nips nips2005 nips2005-142 knowledge-graph by maker-knowledge-mining

142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Source: pdf

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

Abstract: We propose a mean-ﬁeld approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a mean-ﬁeld approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. [sent-3, score-0.216]

2 We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. [sent-4, score-0.53]

3 At each period of time, each agent has a given state and can make a decision. [sent-9, score-0.494]

4 Additionally, agents receive proﬁts depending on the current states and decisions. [sent-11, score-0.277]

5 When there are more than a few agents participating in the game and/or more than a few states per agent, the curse of dimensionality renders dynamic programming algorithms intractable. [sent-14, score-0.399]

6 In this paper we consider a class of stochastic dynamic games where the state of an agent captures its competitive advantage. [sent-15, score-0.57]

7 Our main motivation is to consider an important class of models in applied economics, namely, dynamic industry models of imperfect competition. [sent-16, score-0.387]

8 To clarify the type of models we consider, let us describe a speciﬁc example of a dynamic industry model. [sent-18, score-0.349]

9 Consider an industry where a group of ﬁrms can invest to improve the quality of their products over time. [sent-19, score-0.378]

10 In this context, we propose a mean-ﬁeld approximation approach that dramatically simpliﬁes the computational complexity of stochastic dynamic games. [sent-24, score-0.216]

11 We propose a simple algorithm for computing an “oblivious” equilibrium in which each agent is assumed to make decisions based only on its own state and knowledge of the long run equilibrium distribution of states, but where agents ignore current information about rivals’ states. [sent-25, score-1.334]

12 We prove that, if the distribution of agents obeys a certain “light-tail” condition, when the number of agents becomes large the oblivious equilibrium approximates a MPE. [sent-26, score-1.377]

13 We apply our method to analyze dynamic industry models of imperfect competition. [sent-28, score-0.387]

14 The state of each agent captures its ability to compete in the environment. [sent-43, score-0.459]

15 At time t, the state of agent i ∈ S is denoted by xit ∈ N. [sent-44, score-0.559]

16 We deﬁne the system state st to be a vector over individual states that speciﬁes, for each state x ∈ N, the number of agents at state x in period t. [sent-45, score-0.838]

17 For each i ∈ S, we deﬁne s−i,t ∈ S to be the state of the competitors of agent i; that is, s−i,t (x) = st (x) − 1 if xit = x, and s−i,t (x) = st (x), otherwise. [sent-47, score-0.862]

18 An agent’s single period expected proﬁt πm (xit , s−i,t ) depends on its state xit , its competitors’ state s−i,t and a parameter m ∈ + . [sent-49, score-0.48]

19 For example, in the context of an industry model, m could represent the total number of consumers, that is, the size of the pie to be divided among all agents. [sent-50, score-0.292]

20 We interpret this decision as an investment to improve the state at the next period. [sent-54, score-0.26]

21 If an agent invests µit ∈ + , then the agent’s state at time t + 1 is given by, xi,t+1 = xit + w(µit , ζi,t+1 ), where the function w captures the impact of investment on the state and ζi,t+1 reﬂects uncertainty in the outcome of investment. [sent-55, score-0.838]

22 For example, in the context of an industry model, uncertainty may arise due to the risk associated with a research endeavor or a marketing campaign. [sent-56, score-0.267]

23 Hence, if the amount invested is larger it is more likely the agent will transit next period to a better state. [sent-58, score-0.419]

24 Each agent aims to maximize expected net present value. [sent-64, score-0.356]

25 The equilibrium concept we will use builds on the notion of a Markov perfect equilibrium (MPE), in the sense of [3]. [sent-66, score-0.623]

26 We further assume that equilibrium is symmetric, such that all agents use a common stationary strategy. [sent-67, score-0.53]

27 In particular, there is a function µ such that at each time t, each agent i ∈ S makes a decision µit = µ(xit , s−i,t ). [sent-68, score-0.299]

28 We deﬁne the value function V (x, s|µ , µ) to be the expected net present value for an agent at state x when its competitors’ state is s, given that its competitors each follows a common strategy µ ∈ M, and the agent itself follows strategy µ ∈ M. [sent-70, score-1.098]

29 In particular, ∞ β k−t (π(xik , s−i,k ) − dιik ) xit = x, s−i,t = s , V (x, s|µ , µ) = Eµ ,µ k=t where i is taken to be the index of an agent at state x at time t, and the subscripts of the expectation indicate the strategy followed by agent i and the strategy followed by its competitors. [sent-71, score-1.022]

30 In an abuse of notation, we will use the shorthand, V (x, s|µ) ≡ V (x, s|µ, µ), to refer to the expected discounted value of proﬁts when agent i follows the same strategy µ as its competitors. [sent-72, score-0.456]

31 An equilibrium to our model comprises a strategy µ ∈ M that satisfy the following condition: (2. [sent-73, score-0.365]

32 µ ∈M Under some technical conditions, one can establish existence of an equilibrium in pure strategies [4]. [sent-75, score-0.383]

33 Dynamic programming algorithms can be used to optimize agent strategies, and equilibria to our model can be computed via their iterative application. [sent-77, score-0.394]

34 In this setting, each agent can potentially make near-optimal decisions based only on its own state and the long run average system state. [sent-81, score-0.555]

35 With this motivation, we consider restricting agent strategies so that each agent’s decisions depend only on the agent’s state. [sent-82, score-0.429]

36 We call such restricted strategies oblivious since they involve decisions made without full knowledge of the circumstances — in particular, the state of the system. [sent-83, score-0.846]

37 Let ˜ ˜ M ⊂ M denote the set of oblivious strategies. [sent-84, score-0.609]

38 Let sµ be the long-run expected system state when all agents use an oblivious strategy ˜ ˜ ˜ µ ∈ M. [sent-86, score-1.094]

39 For an oblivious strategy µ ∈ M we deﬁne an oblivious value function ∞ ˜ V (x|µ , µ) = Eµ β k−t (π(xik , sµ ) − dιik ) xit = x . [sent-87, score-1.444]

40 ˜ k=t This value function should be interpreted as the expected net present value of an agent that is at state x and follows oblivious strategy µ , under the assumption that its competitors’ ˜ ˜ state will be sµ for all time. [sent-88, score-1.252]

41 Again, we abuse notation by using V (x|µ) ≡ V (x|µ, µ) ˜ to refer to the oblivious value function when agent i follows the same strategy µ as its competitors. [sent-89, score-1.04]

42 We now deﬁne a new solution concept: an oblivious equilibrium consists of a strategy ˜ µ ∈ M that satisfy the following condition: ˜ ˜ (3. [sent-90, score-0.974]

43 ˜ µ ∈M In an oblivious equilibrium ﬁrms optimize an oblivious value function assuming that its competitors’ state will be sµ for all time. [sent-92, score-1.617]

44 It ˜ is straightforward to show that an oblivious equilibrium exists under mild technical conditions. [sent-94, score-0.901]

45 With respect to uniqueness, we have been unable to ﬁnd multiple oblivious equilibria in any of the applied problems we have considered, but similarly with the case of MPE, we have no reason to believe that in general there is a unique oblivious equilibrium. [sent-95, score-1.313]

46 4 Asymptotic Results In this section, we establish asymptotic results that provide conditions under which oblivious equilibria offer close approximations to MPE as the number of agents, n, grow. [sent-96, score-0.748]

47 We consider a sequence of systems indexed by the one period proﬁt parameter m and we assume that the number of agents in system m is given by n(m) = am, for some a > 0. [sent-97, score-0.368]

48 Recall that m represents, for example, the total pie to be divided by the agents so it is reasonable to increase n(m) and m at the same rate. [sent-98, score-0.263]

49 From this point onward we let µ(m) denote an oblivious equilibrium for system m. [sent-100, score-0.943]

50 ˜ ˜ Let V (m) and V (m) represent the value function and oblivious value function, respectively, when the system is m. [sent-101, score-0.651]

51 The random variable st denotes the system state ˜ ˜ ˜˜ at time t when every agent uses strategy µ(m) . [sent-103, score-0.631]

52 Hence, st is a stationary process; st is distributed (m) (m) according to q (m) for all t ≥ 0. [sent-106, score-0.22]

53 It will be helpful to decompose st according to st = (m) (m) (m) ft n , where ft is the random vector that represents the fraction of agents in each (m) ˜ state. [sent-107, score-0.589]

54 Similarly, let f (m) ≡ E[ft ] denote the expected fraction of agents in each state. [sent-108, score-0.29]

55 Our aim is to establish that, under certain conditions, oblivious equilibria well-approximate MPE as m grows. [sent-112, score-0.722]

56 A sequence µ(m) ∈ M possesses the asymptotic Markov equilibrium ˜ (AME) property if for all x ∈ N, lim Eµ(m) ˜ m→∞ (m) sup V (m) (x, st µ ∈M (m) |µ , µ(m) ) − V (m) (x, st ˜ |˜(m) ) = 0 . [sent-116, score-0.608]

57 However, recall that each agent state reﬂects its competitive advantage and if there are agents that are too “dominant” this is not necessarily the case. [sent-120, score-0.66]

58 To make this idea more concrete, let us go back to our industry example where ﬁrms invest in quality. [sent-121, score-0.337]

59 (y,f,n) Note that d ln πm(x) is the semi-elasticity of one period proﬁts with respect to the fraction df of agents in state x. [sent-124, score-0.48]

60 df (x) For each x, g(x) is the maximum rate of relative change of any agent’s single-period proﬁt that could result from a small change in the fraction of agents at state x. [sent-126, score-0.392]

61 Since larger competitors tend to have greater inﬂuence on agent proﬁts, g(x) typically increases with x, and can be unbounded. [sent-127, score-0.382]

62 x(m) can be interpreted as the is a random variable with probability mass function f ˜ state of an agent that is randomly sampled from among all agents while the system state is distributed according to its invariant distribution. [sent-130, score-0.832]

63 Put simply, the light tail condition requires that states where a small change in the fraction of agents has a large impact on the proﬁts of other agents, must have a small probability under the invariant distribution. [sent-135, score-0.36]

64 In the previous example of an industry where ﬁrms invest in quality this typically means that very large ﬁrms (and hence high concentration) rarely occur under the invariant distribution. [sent-136, score-0.417]

65 1 and some other regularity conditions1 , the sequence µ(m) of oblivious equilibria possesses the AME property. [sent-140, score-0.724]

66 ˜ 5 Error Bounds While the asymptotic results from Section 4 provide conditions under which the approximation will work well as the number of agents grows, in practice one would also like to know how the approximation performs for a particular system. [sent-141, score-0.431]

67 For that purpose we derive performance bounds on the approximation error that are simple to compute via simulation and can be used to asses the accuracy of the approximation for a particular problem instance. [sent-142, score-0.224]

68 We will quantify approximation error at each agent state x ∈ N by ˜ E supµ ∈M V (x, st |µ , µ) − V (x, st |˜) . [sent-145, score-0.733]

69 Recall that s ˜ is the long run expected state in oblivious equilibrium (E[st ]). [sent-148, score-1.066]

70 Let ax (y) be the expected discounted sum of an indicator of visits to state y for an agent starting at state x that uses strategy µ . [sent-149, score-0.652]

71 For any oblivious equilibrium µ and state x ∈ N, ˜ (5. [sent-152, score-1.008]

72 1) E [∆V ] ≤ 1 E[∆π(st )] + 1−β ax (y) (π(y, s) − E [π(y, st )]) , ˜ y∈N 1 In particular, we require that the single period proﬁt function is “smooth” as a function of its arguments. [sent-153, score-0.218]

73 where ∆V = supµ ∈M V (x, st |µ , µ) − V (x, st |˜) ˜ µ maxy∈N (π(y, s) − π(y, s)). [sent-155, score-0.22]

74 [1] (hereafter EP) introduced an approach to modeling industry dynamics. [sent-160, score-0.267]

75 In this section we use our method to expand the set of dynamic industries that can be analyzed computationally. [sent-163, score-0.223]

76 We consider a model of a single-good industry with quality differentiation. [sent-165, score-0.308]

77 The agents are ﬁrms that can invest to improve the quality of their product over time. [sent-166, score-0.349]

78 First, we propose an algorithm to compute oblivious equilibrium [5]. [sent-174, score-0.917]

79 In practice the parameters would either be estimated using data from a particular industry or chosen to reﬂect an industry under study. [sent-180, score-0.551]

80 For each set of parameters, we use the approximation error bound to compute an upper E[supµ ∈M V (x,s|µ ,˜ )−V (x,s|˜ )] µ µ bound on the percentage error in the value function, , where E[V (x,s|˜ )]] µ µ is the OE strategy and the expectations are taken with respect to s. [sent-183, score-0.373]

81 We compute the previously mentioned percentage approximation error bound for different market sizes m and number of ﬁrms n(m) . [sent-185, score-0.286]

82 As the market size increases, the number of ﬁrms increases and the approximation error bound decreases. [sent-186, score-0.257]

83 These results show that the approximation error bound may sometimes be small (<2%) in these cases, though this would depend on the model and parameter values for the industry under study. [sent-193, score-0.456]

84 To shed light on this issue we compare long-run statistics for the same industry primitives under oblivious equilibrium and MPE strategies. [sent-195, score-1.168]

85 We compare the average values of several economic statistics of interest under the oblivious equilibrium and the MPE invariant distributions. [sent-197, score-1.021]

86 keeping track of the industry state (the actual difference E[V (x,s|˜ )]] µ Note that the the latter quantity should always be smaller than the approximation error bound. [sent-200, score-0.497]

87 When the bound is less than 1% the long-run quantities estimated under oblivious equilibrium and MPE strategies are very close. [sent-202, score-1.078]

88 Performance of the approximation depends on the richness of the equilibrium investment process. [sent-204, score-0.52]

89 Industries with a relatively low cost of investment tend to have a symmetric average distribution over quality levels reﬂecting a rich investment process. [sent-205, score-0.364]

90 In this cases, when the bound is between 1-20%, the long-run quantities estimated under oblivious equilibrium and MPE strategies are still quite close. [sent-206, score-1.062]

91 In industries with high investment cost the industry (system) state tends to be skewed, reﬂecting low levels of investment. [sent-207, score-0.623]

92 The previous results suggest that MPE dynamics are well-approximated by oblivious equilibrium strategies when the approximation error bound is small (less than 1-2% and in some cases even up to 20 %). [sent-212, score-1.17]

93 Our results demonstrate that the oblivious equilibrium approximation signiﬁcantly expands the range of applied problems that can be analyzed computationally. [sent-213, score-1.045]

94 As an alternative, we proposed a method for approximating MPE behavior using an oblivious equilibrium, where agents make decisions only based on their own state and the long run average system state. [sent-216, score-1.103]

95 We also introduced a simple algorithm to compute an oblivious equilibrium. [sent-218, score-0.609]

96 To facilitate using oblivious equilibrium in practice, we derived approximation error bounds that indicate how good the approximation is in any particular problem under study. [sent-219, score-1.125]

97 We use our methods to analyze dynamic industry models of imperfect competition and showed that oblivious equilibrium often yields a good approximation of MPE behavior for industries with a couple hundred ﬁrms, and sometimes even with just tens of ﬁrms. [sent-221, score-1.538]

98 We have considered very simple strategies that are functions only of an agent’s own state and the long run average system state. [sent-222, score-0.272]

99 Solving for equilibria of this type would be more difﬁcult than solving for oblivious equilibria, but is still likely to be computationally feasible. [sent-225, score-0.704]

100 Foundations of Markov-perfect industry dynamics: Existence, puriﬁcation, and multiplicity. [sent-248, score-0.267]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oblivious', 0.609), ('agent', 0.299), ('equilibrium', 0.292), ('industry', 0.267), ('agents', 0.238), ('rms', 0.191), ('mpe', 0.191), ('xit', 0.153), ('investment', 0.153), ('st', 0.11), ('state', 0.107), ('pro', 0.105), ('ame', 0.096), ('industries', 0.096), ('equilibria', 0.095), ('period', 0.088), ('market', 0.084), ('competitors', 0.083), ('dynamic', 0.082), ('approximation', 0.075), ('strategy', 0.073), ('strategies', 0.073), ('invest', 0.07), ('bound', 0.066), ('economic', 0.064), ('decisions', 0.057), ('sup', 0.05), ('ft', 0.044), ('expands', 0.042), ('system', 0.042), ('quality', 0.041), ('states', 0.039), ('invariant', 0.039), ('imperfect', 0.038), ('abuse', 0.038), ('compete', 0.034), ('ts', 0.032), ('net', 0.032), ('error', 0.032), ('advertising', 0.032), ('benkard', 0.032), ('consumers', 0.032), ('deviating', 0.032), ('invested', 0.032), ('weintraub', 0.032), ('percentage', 0.029), ('consumer', 0.028), ('xik', 0.028), ('rm', 0.028), ('analyzed', 0.027), ('stochastic', 0.027), ('fraction', 0.027), ('hundred', 0.026), ('asymptotic', 0.026), ('surplus', 0.025), ('pie', 0.025), ('economics', 0.025), ('expected', 0.025), ('bounds', 0.025), ('uniqueness', 0.024), ('exit', 0.024), ('dynamics', 0.023), ('differentiation', 0.022), ('curse', 0.022), ('quantities', 0.022), ('discounted', 0.021), ('notation', 0.021), ('dominant', 0.02), ('df', 0.02), ('perfect', 0.02), ('ax', 0.02), ('possesses', 0.02), ('games', 0.02), ('contexts', 0.02), ('ik', 0.019), ('tens', 0.019), ('converged', 0.019), ('captures', 0.019), ('concept', 0.019), ('establish', 0.018), ('expand', 0.018), ('game', 0.018), ('couple', 0.018), ('index', 0.018), ('ep', 0.018), ('ecting', 0.018), ('particular', 0.017), ('periods', 0.017), ('grow', 0.017), ('run', 0.017), ('condition', 0.017), ('average', 0.017), ('sometimes', 0.016), ('track', 0.016), ('dramatically', 0.016), ('less', 0.016), ('represents', 0.016), ('propose', 0.016), ('long', 0.016), ('competitive', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

2 0.20240502 36 nips-2005-Bayesian models of human action understanding

Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum

Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also ﬁts quantitative predictions that adult observers make in a new experiment.

3 0.17542982 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

Abstract: Although variants of value iteration have been proposed for ﬁnding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot ﬁnd stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call “cyclic equilibria.” We prove that value iteration identiﬁes cyclic equilibria in a class of games in which it fails to ﬁnd stationary equilibria. We also demonstrate empirically that value iteration ﬁnds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games. 1

4 0.16709201 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

5 0.14331999 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

Author: Eddie Rafols, Anna Koop, Richard S. Sutton

Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are deﬁned in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional speciﬁcation of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the ﬁrst prediction with each of the other predictions. TD networks are the ﬁrst framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are deﬁned in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Signiﬁcant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the ﬁrst intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The ﬁrst is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], speciﬁes how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally deﬁned by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we brieﬂy present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modiﬁable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u deﬁnes the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We deﬁne zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we deﬁne ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of speciﬁc action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a signiﬁcant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function speciﬁes which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the ﬁgure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this ﬁrst experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a ﬁxed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difﬁculty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the ﬁrst 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was ﬁlled in in the same way, and the ﬁrst and third groups were zero. If the action was F, the third group was ﬁlled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the ﬁrst experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the ﬁrst row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the ﬁgure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory veriﬁcation. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the ﬁrst column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modiﬁcations to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More signiﬁcant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of ﬁnite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artiﬁcial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artiﬁcial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artiﬁcial Intelligence.

6 0.13477251 153 nips-2005-Policy-Gradient Methods for Planning

7 0.13117436 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

8 0.062857635 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.059611674 111 nips-2005-Learning Influence among Interacting Markov Chains

10 0.042625051 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

11 0.041211888 78 nips-2005-From Weighted Classification to Policy Search

12 0.040552836 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

13 0.040522296 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

14 0.040082049 108 nips-2005-Layered Dynamic Textures

15 0.037923999 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

16 0.037808359 148 nips-2005-Online Discovery and Learning of Predictive State Representations

17 0.037435155 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

18 0.036712468 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

19 0.036469202 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

20 0.036444549 85 nips-2005-Generalization to Unseen Cases

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.128), (1, -0.017), (2, 0.282), (3, -0.076), (4, -0.015), (5, -0.01), (6, -0.002), (7, 0.021), (8, -0.013), (9, -0.052), (10, -0.021), (11, 0.009), (12, 0.034), (13, 0.039), (14, 0.015), (15, -0.011), (16, 0.022), (17, 0.108), (18, -0.014), (19, 0.18), (20, 0.102), (21, -0.19), (22, 0.07), (23, 0.128), (24, -0.071), (25, -0.234), (26, 0.092), (27, 0.093), (28, 0.085), (29, -0.019), (30, -0.078), (31, -0.132), (32, 0.008), (33, -0.036), (34, -0.013), (35, 0.096), (36, -0.053), (37, 0.039), (38, -0.069), (39, 0.103), (40, 0.102), (41, -0.061), (42, 0.105), (43, 0.029), (44, -0.043), (45, 0.064), (46, 0.059), (47, -0.024), (48, -0.041), (49, -0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94476658 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

2 0.7791453 36 nips-2005-Bayesian models of human action understanding

Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum

3 0.62619454 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We ﬁrst describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

4 0.55665004 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

5 0.50517792 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

Abstract: We present a model that learns the inﬂuence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model. 1

6 0.48083964 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

7 0.46047679 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

8 0.43856582 153 nips-2005-Policy-Gradient Methods for Planning

9 0.34011853 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

10 0.22946179 85 nips-2005-Generalization to Unseen Cases

11 0.22249427 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning

12 0.21906495 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

13 0.21553364 62 nips-2005-Efficient Estimation of OOMs

14 0.21511038 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

15 0.21053886 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis

16 0.21013272 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

17 0.20847906 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning

18 0.20331162 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

19 0.20224456 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

20 0.19959769 148 nips-2005-Online Discovery and Learning of Predictive State Representations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.059), (10, 0.063), (11, 0.013), (27, 0.021), (29, 0.014), (31, 0.123), (34, 0.062), (35, 0.321), (39, 0.015), (55, 0.035), (69, 0.043), (73, 0.034), (88, 0.062), (91, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76724964 121 nips-2005-Location-based activity recognition

Author: Lin Liao, Dieter Fox, Henry Kautz

Abstract: Learning patterns of human behavior from sensor data is extremely important for high-level activity inference. We show how to extract and label a person’s activities and signiﬁcant places from traces of GPS data. In contrast to existing techniques, our approach simultaneously detects and classiﬁes the signiﬁcant locations of a person and takes the highlevel context into account. Our system uses relational Markov networks to represent the hierarchical activity model that encodes the complex relations among GPS readings, activities and signiﬁcant places. We apply FFT-based message passing to perform efﬁcient summation over large numbers of nodes in the networks. We present experiments that show signiﬁcant improvements over existing techniques. 1

same-paper 2 0.75461239 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

3 0.6814512 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

Author: Firas Hamze, Nando de Freitas

Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difﬁcult partition function of the graph. The algorithm ﬁts into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.

4 0.48296872 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

5 0.47411558 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random ﬁeld: i.e., the model parameters are estimated on the basis of an initial set of data, and then the ﬁtted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for ﬁtting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the inﬁnite data limit) is provably beneﬁcial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random ﬁelds; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1

6 0.46868381 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

7 0.46354279 78 nips-2005-From Weighted Classification to Policy Search

8 0.46246153 108 nips-2005-Layered Dynamic Textures

9 0.45076853 153 nips-2005-Policy-Gradient Methods for Planning

10 0.44886303 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

11 0.44770378 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

12 0.4464241 46 nips-2005-Consensus Propagation

13 0.44172791 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

14 0.43531784 144 nips-2005-Off-policy Learning with Options and Recognizers

15 0.42640457 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

16 0.42638633 133 nips-2005-Nested sampling for Potts models

17 0.42511889 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

18 0.42029634 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

19 0.41828543 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

20 0.41305593 36 nips-2005-Bayesian models of human action understanding