nips nips2005 nips2005-53 knowledge-graph by maker-knowledge-mining

53 nips-2005-Cyclic Equilibria in Markov Games

Source: pdf

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

Abstract: Although variants of value iteration have been proposed for ﬁnding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demonstrate by construction that existing variants of value iteration cannot ﬁnd stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call “cyclic equilibria.” We prove that value iteration identiﬁes cyclic equilibria in a class of games in which it fails to ﬁnd stationary equilibria. We also demonstrate empirically that value iteration ﬁnds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Although variants of value iteration have been proposed for ﬁnding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. [sent-6, score-0.574]

2 In this paper, we demonstrate by construction that existing variants of value iteration cannot ﬁnd stationary equilibrium policies in arbitrary general-sum Markov games. [sent-7, score-0.891]

3 Instead, we propose an alternative interpretation of the output of value iteration based on a new (non-stationary) equilibrium concept that we call “cyclic equilibria. [sent-8, score-0.548]

4 ” We prove that value iteration identiﬁes cyclic equilibria in a class of games in which it fails to ﬁnd stationary equilibria. [sent-9, score-1.49]

5 We also demonstrate empirically that value iteration ﬁnds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games. [sent-10, score-0.834]

6 1 Introduction Value iteration (Bellman, 1957) has proven its worth in a variety of sequential-decisionmaking settings, most signiﬁcantly single-agent environments (Puterman, 1994), team games, and two-player zero-sum games (Shapley, 1953). [sent-11, score-0.626]

7 In value iteration for Markov decision processes and team Markov games, the value of a state is deﬁned to be the maximum over all actions of the value of the combination of the state and action (or Q value). [sent-12, score-0.467]

8 Learning algorithms based on this update have been shown to compute equilibria in both model-based scenarios (Brafman & Tennenholtz, 2002) and Q-learning-like model-free scenarios (Littman & Szepesv´ ri, 1996). [sent-14, score-0.323]

9 Here, value-update rules based on select Nash or correlated equilibria have been evaluated empirically and have been shown to perform reasonably in some settings. [sent-16, score-0.371]

10 None has been identiﬁed that computes equilibria in general, however, leaving open the question of whether such an update rule is even possible. [sent-17, score-0.315]

11 Our main negative theoretical result is that an entire class of value-iteration update rules, including all those mentioned above, can be excluded from consideration for computing stationary equilibria in general-sum Markov games. [sent-18, score-0.524]

12 We demonstrate a class of games in which Q values, even those corresponding to an equilibrium policy, contain insufﬁcient information for reconstructing an equilibrium policy. [sent-20, score-1.162]

13 Faced with the impossibility of developing algorithms along the lines of traditional value iteration that ﬁnd stationary equilibria, we suggest an alternative equilibrium concept— cyclic equilibria. [sent-21, score-1.16]

14 A cyclic equilibrium is a kind of non-stationary joint policy that satisﬁes the standard conditions for equilibria (no incentive to deviate unilaterally). [sent-22, score-1.141]

15 However, unlike conditional non-stationary policies such as tit-for-tat and ﬁnite-state strategies based on the “folk theorem” (Osborne & Rubinstein, 1994), cyclic equilibria cycle rigidly through a set of stationary policies. [sent-23, score-1.0]

16 First, we consider the class of two-player two-state two-action games used to show that Q values cannot reconstruct all stationary equilibrium. [sent-25, score-0.671]

17 1 shows that value iteration ﬁnds cyclic equilibria for all games in this class. [sent-27, score-1.257]

18 In contrast, value iteration ﬁnds cyclic equilibria for nearly all the games. [sent-30, score-0.834]

19 The success of value iteration in ﬁnding cyclic equilibria suggests this generalized solution concept could be useful for constructing robust multiagent learning algorithms. [sent-31, score-0.88]

20 2 An Impossibility Result for Q Values In this section, we consider a subclass of Markov games in which transitions are deterministic and are controlled by one player at a time. [sent-32, score-0.822]

21 We show that this class includes games that have no deterministic equilibrium policies. [sent-33, score-0.902]

22 The ﬁrst, a negative result, states that the Q values used in existing value-iteration algorithms are insufﬁcient for deriving equilibrium policies. [sent-35, score-0.454]

23 1, is a positive result that states that value iteration does converge to cyclic equilibrium policies in this class of games. [sent-37, score-1.109]

24 Deﬁnition 1 A Markov game Γ = [S, N, A, T, R, γ] is a set of states S, a set of players N = {1, . [sent-40, score-0.378]

25 , n}, a set of actions for each player in each state {Ai,s }s∈S,i∈N (where we represent the set of all state-action pairs as A ≡ s∈S {s} × i∈N Ai,s , a transition function T : A → ∆(S), a reward function R : A → Rn , and a discount factor γ. [sent-43, score-0.417]

26 A stationary policy is a set of distributions {π(s) : s ∈ S}, where for all s ∈ S, π(s) ∈ ∆ (As ). [sent-45, score-0.339]

27 Given a stationary policy π, deﬁne V π,Γ : S → Rn and Qπ,Γ : A → Rn to be the unique pair of functions satisfying the following system of equations: for all i ∈ N , for all (s, a) ∈ A, Viπ,Γ (s) = π(s)(a)Qπ,Γ (s, a), i (1) a∈As Qπ,Γ (s, a) = Ri (s, a) + γ i T (s, a)(s′ )Viπ,Γ (s′ ). [sent-46, score-0.364]

28 (2) s′ ∈S A deterministic Markov game is a Markov game Γ where the transition function is deterministic: T : A → S. [sent-47, score-0.593]

29 A turn-taking game is a Markov game Γ where in every state, only one player has a choice of action. [sent-48, score-0.813]

30 Formally, for all s ∈ S, there exists a player i ∈ N such that for all other players j ∈ N \{i}, |Aj,s | = 1. [sent-49, score-0.355]

31 2 A Negative Result for Stationary Equilibria A NoSDE (pronounced “nasty”) game is a deterministic turn-taking Markov game Γ with two players, two states, no more than two actions for either player in either state, and no deterministic stationary equilibrium policy. [sent-51, score-1.629]

32 That the set of NoSDE games is non-empty is demonstrated by the game depicted in Figure 1. [sent-52, score-0.688]

33 This game has no deterministic stationary equilibrium policy: If Player 1 sends, Player 2 prefers to send; but, if Player 2 sends, Player 1 prefers to keep; and, if Player 1 keeps, Player 2 prefers to keep; but, if Player 2 keeps, Player 1 prefers to send. [sent-53, score-1.083]

34 No deterministic policy is an equilibrium because one player will always have an incentive to change policies. [sent-54, score-0.884]

35 In the unique stationary equilibrium, Player 1 sends with probability 2/3 and Player 2 sends with probability 5/12. [sent-57, score-0.326]

36 Lemma 1 Every NoSDE game has a unique stationary equilibrium policy. [sent-58, score-0.843]

37 This fact can be demonstrated simply by a game with one state where the utilities correspond to a bimatrix game with no deterministic equilibria (penny matching, say). [sent-60, score-0.887]

38 Random actions in these games are sometimes linked with strategies that use “faking” or “blufﬁng” to avoid being exploited. [sent-61, score-0.522]

39 That NoSDE games exist is surprising, in that randomness is needed even though actions are always taken with complete information about the other player’s choice and the state of the game. [sent-62, score-0.545]

40 Current value-iteration algorithms attempt to ﬁnd the Q values of a game with the goal of using these values to ﬁnd a stationary equilibrium of the game. [sent-64, score-0.818]

41 The main theorem of this paper states that it is not possible to derive a policy from the Q values for NoSDE games, and therefore in general Markov games. [sent-65, score-0.206]

42 Theorem 1 For any NoSDE game Γ = [S, N, A, T, R] with a unique equilibrium policy π, there exists another NoSDE game Γ′ = [S, N, A, T, R′ ] with its own unique equilibrium ′ ′ ′ ′ policy π ′ such that Qπ,Γ = Qπ ,Γ but π = π ′ and V π,Γ = V π ,Γ . [sent-66, score-1.504]

43 This result establishes that computing or learning Q values is insufﬁcient to compute a stationary equilibrium of a game. [sent-67, score-0.568]

44 2 In this paper we suggest an alternative, where we still 1 The policy is both a correlated equilibrium and a Nash equilibrium. [sent-68, score-0.536]

45 2 do value iteration in the same way, but we extract a cyclic equilibrium from the sequence of values instead of a stationary one. [sent-71, score-1.116]

46 3 A New Goal: Cyclic Equilibria A cyclic policy is a ﬁnite sequence of stationary policies π = {π1 , . [sent-72, score-0.805]

47 Deﬁnition 2 Given a Markov game Γ, a cyclic correlated equilibrium is a cyclic policy π, where for all j ∈ {1, . [sent-80, score-1.526]

48 A similar deﬁnition can be constructed for Nash equilibria by insisting that all policies πj (s) are product distributions. [sent-85, score-0.367]

49 At each stage, a typical correlated equilibrium is executed, meaning that the referee chooses a joint action a from πj (s), tells each agent its part of that joint action, and no agent can improve its value by eschewing the referee’s advice. [sent-87, score-0.632]

50 If no agent can improve its value by more than ǫ at any stage, we say π is an ǫ-correlated cyclic equilibrium. [sent-88, score-0.46]

51 A stationary correlated equilibrium is a cyclic correlated equilibrium with k = 1. [sent-89, score-1.409]

52 In the next section, we show how value iteration can be used to derive cyclic correlated equilibria. [sent-90, score-0.607]

53 4 Value Iteration in General-Sum Markov Games For a game Γ, deﬁne QΓ = (Rn )A to be the set of all state-action (Q) value functions, VΓ = (Rn )S to be the set of all value functions, and ΠΓ to be the set of all stationary policies. [sent-91, score-0.559]

54 Traditionally, value iteration can be broken down into estimating a Q value based upon a value function, selecting a policy π given the Q values, and deriving a value function based upon π and the Q value functions. [sent-92, score-0.53]

55 i π(s)(ai , a−i ) Qi (s, ai , a−i ) ≥ a−i ∈A−i,s (6) (ai ,a−i )∈As Essentially, Q and π agree if π is a best response for each player given payoffs Q. [sent-95, score-0.384]

56 In essence, these rules update values assuming an equilibrium policy for a one-stage game with Q(s, a) providing the terminal rewards. [sent-98, score-0.788]

57 (Utilitarian-CE, which we return to later, selects the correlated equilibrium in which total of the payoffs is maximized. [sent-100, score-0.455]

58 ) Foe-VI and Friend-VI (Littman, 2001) do not ﬁt into our formalism, but it can be proven that in NoSDE games they converge to deterministic policies that are neither stationary nor cyclic equilibria. [sent-101, score-1.262]

59 (7) s∈S,i∈N Using our notation, the value-iteration algorithm for general-sum Markov games can be described as follows. [sent-103, score-0.438]

60 If a stationary equilibrium is sought, the ﬁnal policy is returned. [sent-120, score-0.692]

61 For cyclic equilibria, we have a variety of options for how many past stationary policies we want to consider for forming a cycle. [sent-134, score-0.681]

62 Observe that the order of the policies returned by value iteration is reversed to form a cyclic equilibrium. [sent-137, score-0.67]

63 ǫγ 1−γ -correlated Fact 2 If GetCycle returns a cyclic policy of length k and d(V T , V T −k ) = ǫ, then GetCyǫγ cle returns an 1−γ k -correlated cyclic equilibrium. [sent-163, score-0.92]

64 Therefore, the two practical (and open) questions are (1) how many iterations does it take to ﬁnd an ǫ-correlated cyclic equilibrium? [sent-168, score-0.37]

65 and (2) How large is the cyclic equilibrium that is found? [sent-169, score-0.723]

66 Theorem 2 Given the utilitarian-CE equilibrium-selection rule f , for every NoSDE game Γ, for every V 0 ∈ VΓ , there exists some ﬁnite T such that GetCycle(Γ, V 0 , f, T, ⌈T /2⌉) returns a cyclic correlated equilibrium. [sent-172, score-0.775]

67 Theorem 3 Given the utilitarian-CE equilibrium-selection rule f , for any NoSDE game Γ with unique equilibrium π, for every V 0 ∈ VΓ , the value-function sequence {V 1 , V 2 , . [sent-175, score-0.674]

68 Theorem 4 Given the game Γ in Figure 1 and its stationary equilibrium π, given Vi0 (s) = 0 for all i ∈ N , s ∈ S, then for any update rule f ∈ FΓ , the value-function sequence {V 1 , V 2 , . [sent-180, score-0.862]

69 5 Empirical Results To complement the formal results of the previous sections, we ran two batteries of tests on value iteration in randomly generated games. [sent-184, score-0.194]

70 We assessed the convergence behavior of value iteration to stationary and cyclic equilibria. [sent-185, score-0.788]

71 1 Experimental Details Our game generator took as input the set of players N , the set of states S, and for each player i and state s, the actions Ai,s . [sent-187, score-0.795]

72 The primary dependent variable in our results was the frequency with which value iteration converged to a stationary Nash equilibrium or a cyclic Nash equilibrium (of length less than 100). [sent-192, score-1.555]

73 To determine convergence, we ﬁrst ran value iteration for 1000 steps. [sent-193, score-0.194]

74 0001, then we considered value iteration to have converged to a stationary policy. [sent-195, score-0.479]

75 0001, (8) then we considered value iteration to have converged to a cycle. [sent-200, score-0.264]

76 3 To determine if a game has a deterministic equilibrium, for every deterministic policy π, we ran policy evaluation (for 1000 iterations) to estimate V π,Γ and Qπ,Γ , and then checked if π was an ǫ-correlated equilibrium for ǫ=0. [sent-201, score-1.075]

77 2 Turn-taking Games In the ﬁrst battery of tests, we considered sets of turn-taking games with x states and y actions: formally, there were x states {1, . [sent-204, score-0.566]

78 In odd-numbered states, Player 1 had y 3 In contrast to the GetCycle algorithm, we are here concerned with ﬁnding a cyclic equilibrium so we check an entire cycle for convergence. [sent-208, score-0.771]

79 The graph plots the number of games where value iteration did not converge to a stationary equilibrium. [sent-210, score-0.861]

80 (Right) Frequency of convergence on 100 randomly generated games with simultaneous actions. [sent-211, score-0.463]

81 Cyclic uCE is the number of times utilitarian-CE converged to a cyclic equilibrium. [sent-212, score-0.456]

82 OTComb is the number of games where any one of Friend-VI, Foe-VI, utilitarian-NE-VI, and 5 variants of correlated equilibriumVI: dictatorial-CE-VI (First Player), dictatorial-CE-VI (Second Player), utilitarian-CE-VI, plutocratic-CE-VI, and egalitarian-VI converged to a stationary equilibrium. [sent-213, score-0.831]

83 OTBest is the maximum number of games where the best ﬁxed choice of the equilibrium-selection rule converged. [sent-214, score-0.462]

84 uCE is the number of games in which utilitarian-CE-VI converged to a stationary equilibrium. [sent-215, score-0.739]

85 Figure 2 (left) shows the number of generated games for which value iteration did not converge to a stationary equilibrium. [sent-219, score-0.861]

86 We found that nearly half (48%, as many as 5% of the total set) of these non-converged games had no stationary, deterministic equilibria (they were NoSDE games). [sent-220, score-0.817]

87 The remainder of the stationary, deterministic equilibria were simply not discovered by value iteration. [sent-221, score-0.411]

88 We also found that value iteration converged to cycles of length 100 or less in 99. [sent-222, score-0.264]

89 3 Simultaneous Games In a second set of experiments, we generated two-player Markov games where both agents have at least two actions in every state. [sent-225, score-0.544]

90 We varied the number of states between 2 and 9, and had either 2 or 3 actions for every agent in every state. [sent-226, score-0.235]

91 Figure 2 (right) summarizes results for 3-action games (2-actions games were qualitatively similar, but converged more often). [sent-227, score-0.962]

92 Note that the fraction of random games on which the algorithms converged to stationary equilibria decreases as the number of states increases. [sent-228, score-1.074]

93 This result holds because the larger the game, the larger the chance that value iteration will fall into a cycle on some subset of the states. [sent-229, score-0.226]

94 Once again, we see that the cyclic equilibria are found much more reliably than stationary equilibria by value-iteration algorithms. [sent-230, score-1.127]

95 For example, utilitarian-CE converges to a cyclic correlated equilibrium about 99% of the time, whereas with 10 states and 3 actions, on 26% of the games none of the techniques converge. [sent-231, score-1.303]

96 In fact, there are an inﬁnite number of such games with different Nash equilibrium value functions that have identical Q-value functions. [sent-234, score-0.838]

97 This result holds for proposed variants of value iteration from the literature such as updating via a correlated equilibrium or a Nash equilibrium, since, in turn-taking Markov games, both rules reduce to updating via the action with the maximum value for the controlling player. [sent-235, score-0.754]

98 Our results paint a bleak picture for the use of value-iteration-based algorithms for computing stationary equilibria. [sent-236, score-0.215]

99 However, in a class of games we called NoSDE games, a natural extension of value iteration converges to a limit cycle, which is in fact a cyclic (nonstationary) Nash equilibrium policy. [sent-237, score-1.357]

100 Such cyclic equilibria can also be found reliably for randomly generated games and there is evidence that they appear in some naturally occurring problems (Tesauro & Kephart, 1999). [sent-238, score-1.079]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('games', 0.438), ('cyclic', 0.37), ('equilibrium', 0.353), ('player', 0.291), ('equilibria', 0.271), ('game', 0.25), ('stationary', 0.215), ('noop', 0.191), ('nosde', 0.191), ('nash', 0.132), ('iteration', 0.131), ('policy', 0.124), ('policies', 0.096), ('deterministic', 0.093), ('converged', 0.086), ('actions', 0.084), ('markov', 0.075), ('valueiteration', 0.073), ('send', 0.069), ('states', 0.064), ('players', 0.064), ('getcycle', 0.059), ('greenwald', 0.059), ('uce', 0.059), ('correlated', 0.059), ('keep', 0.055), ('qt', 0.05), ('littman', 0.049), ('cycle', 0.048), ('ai', 0.047), ('value', 0.047), ('getstrategy', 0.044), ('impossibility', 0.044), ('maxcycle', 0.044), ('multiagent', 0.044), ('referee', 0.044), ('sends', 0.043), ('prefers', 0.043), ('agent', 0.043), ('action', 0.043), ('rules', 0.041), ('vi', 0.038), ('ri', 0.035), ('variants', 0.033), ('converge', 0.03), ('integer', 0.03), ('brafman', 0.029), ('inck', 0.029), ('kephart', 0.029), ('osborne', 0.029), ('otbest', 0.029), ('otcomb', 0.029), ('puterman', 0.029), ('rubinstein', 0.029), ('shapley', 0.029), ('szepesv', 0.029), ('tennenholtz', 0.029), ('wellman', 0.029), ('returns', 0.028), ('returned', 0.026), ('hu', 0.026), ('payoffs', 0.026), ('unique', 0.025), ('insuf', 0.025), ('convergence', 0.025), ('rn', 0.024), ('rule', 0.024), ('state', 0.023), ('incentive', 0.023), ('tesauro', 0.023), ('every', 0.022), ('ingenuity', 0.022), ('nonstationary', 0.022), ('team', 0.022), ('deriving', 0.021), ('nj', 0.021), ('agree', 0.02), ('bellman', 0.02), ('proven', 0.02), ('update', 0.02), ('alberta', 0.019), ('discount', 0.019), ('none', 0.019), ('generator', 0.019), ('broken', 0.019), ('hall', 0.018), ('theorem', 0.018), ('class', 0.018), ('return', 0.017), ('proceedings', 0.017), ('concept', 0.017), ('nds', 0.017), ('existing', 0.016), ('scenarios', 0.016), ('keeps', 0.016), ('planning', 0.016), ('ran', 0.016), ('environments', 0.015), ('nearly', 0.015), ('stage', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

2 0.25105038 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

Abstract: We present a model that learns the inﬂuence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model. 1

3 0.17542982 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

Abstract: We propose a mean-ﬁeld approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1

4 0.13751455 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classiﬁcation problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a ﬁnite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classiﬁcation methods can be applied to ﬁnd policies in the reinforcement learning problem. 1

5 0.10467299 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a ﬁlter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the ﬁrst off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difﬁculties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be inﬁnite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer speciﬁes a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct speciﬁcation of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The beneﬁts of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is deﬁned in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modiﬁcation of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a ﬁxed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to ﬁnd πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer deﬁned by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a ﬁxed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions speciﬁed by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is deﬁned as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly speciﬁed and ﬁxed in order for an option to be well deﬁned. Here, we will instead deﬁne options implicitly, using the notion of a recognizer. A recognizer is deﬁned as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only deﬁned at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is deﬁned as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be deﬁned between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the deﬁnition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is deﬁned in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisﬁes: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Deﬁning Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have signiﬁcantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) speciﬁes the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution deﬁned by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is ﬁnite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to deﬁne importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is deﬁned as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be deﬁned in terms of the features used to represent the model. We will then deﬁne the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will deﬁne recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-deﬁned policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

6 0.091541909 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

7 0.090028293 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

8 0.089476936 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.082910739 153 nips-2005-Policy-Gradient Methods for Planning

10 0.061189666 22 nips-2005-An Analog Visual Pre-Processing Processor Employing Cyclic Line Access in Only-Nearest-Neighbor-Interconnects Architecture

11 0.055225939 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

12 0.0524684 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

13 0.051226042 36 nips-2005-Bayesian models of human action understanding

14 0.051054154 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

15 0.042909089 148 nips-2005-Online Discovery and Learning of Predictive State Representations

16 0.040590215 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

17 0.039189719 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

18 0.038919464 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

19 0.033818781 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

20 0.030980842 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.104), (1, -0.012), (2, 0.255), (3, -0.089), (4, -0.033), (5, -0.009), (6, -0.006), (7, 0.031), (8, -0.006), (9, -0.071), (10, -0.071), (11, -0.005), (12, 0.02), (13, 0.013), (14, -0.034), (15, -0.016), (16, 0.009), (17, 0.077), (18, 0.016), (19, 0.063), (20, -0.013), (21, -0.162), (22, -0.028), (23, 0.102), (24, -0.107), (25, -0.205), (26, 0.125), (27, 0.144), (28, -0.095), (29, 0.007), (30, -0.233), (31, -0.169), (32, 0.007), (33, -0.061), (34, 0.341), (35, -0.133), (36, -0.046), (37, -0.084), (38, 0.016), (39, -0.057), (40, 0.061), (41, 0.074), (42, -0.014), (43, 0.08), (44, 0.014), (45, -0.029), (46, -0.037), (47, -0.006), (48, 0.108), (49, -0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97923881 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

2 0.84925634 111 nips-2005-Learning Influence among Interacting Markov Chains

Author: Dong Zhang, Daniel Gatica-perez, Samy Bengio, Deb Roy

3 0.54074603 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy

4 0.3947593 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

5 0.37201715 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

Author: Christopher Williams, John Quinn, Neil Mcintosh

Abstract: The observed physiological dynamics of an infant receiving intensive care are affected by many possible factors, including interventions to the baby, the operation of the monitoring equipment and the state of health. The Factorial Switching Kalman Filter can be used to infer the presence of such factors from a sequence of observations, and to estimate the true values where these observations have been corrupted. We apply this model to clinical time series data and show it to be effective in identifying a number of artifactual and physiological patterns. 1

6 0.32834926 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

7 0.27201355 78 nips-2005-From Weighted Classification to Policy Search

8 0.25769579 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.25111085 153 nips-2005-Policy-Gradient Methods for Planning

10 0.25030813 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

11 0.23469208 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

12 0.22626664 62 nips-2005-Efficient Estimation of OOMs

13 0.22116679 22 nips-2005-An Analog Visual Pre-Processing Processor Employing Cyclic Line Access in Only-Nearest-Neighbor-Interconnects Architecture

14 0.20244364 144 nips-2005-Off-policy Learning with Options and Recognizers

15 0.19084896 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

16 0.18842533 162 nips-2005-Rate Distortion Codes in Sensor Networks: A System-level Analysis

17 0.18423802 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

18 0.17589864 36 nips-2005-Bayesian models of human action understanding

19 0.17183825 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

20 0.16855592 108 nips-2005-Layered Dynamic Textures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.039), (10, 0.081), (27, 0.044), (29, 0.413), (31, 0.068), (34, 0.057), (55, 0.033), (56, 0.012), (69, 0.036), (73, 0.016), (88, 0.04), (91, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83024126 53 nips-2005-Cyclic Equilibria in Markov Games

Author: Martin Zinkevich, Amy Greenwald, Michael L. Littman

2 0.49115625 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

Author: O. P. Kreidl, Alan S. Willsky

Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed ofﬂine, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efﬁciency and (iii) connections to active research areas. 1

3 0.3124502 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with ﬁxed points. These bounds identify beneﬁts to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same ﬁxed points as TD(0). Our analysis also leads to the ﬁrst performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a ﬁnite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a ﬁnite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is deﬁned by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , deﬁned by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is deﬁned by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, deﬁned by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we deﬁne a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justiﬁed by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an afﬁne space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique ﬁxed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The ﬁrst inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefﬁcients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for ﬁxed MDP parameters (S, U, g, p) and ﬁxed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to ﬁnite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefﬁcient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a ﬁnite state space. However, as the following theorem establishes, the coefﬁcient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else ﬁxed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular deﬁnition: the weights ˜ ˜ are used in deﬁning r and now we are deﬁning the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisﬁes Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated beneﬁts. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two signiﬁcant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is deﬁned by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we deﬁne a modiﬁed dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modiﬁed dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efﬁciently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisﬁes Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the ﬁrst time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a ﬁxed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a ﬁxed policy, suggest that approximation error is ampliﬁed by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisﬁes this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes inﬁnite as α approaches 1. On the other hand, generalization of Theorem 6 yields the ﬁrst performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-deﬁned because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Ofﬁce of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic beneﬁted from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has beneﬁted from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of ﬁxed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psﬁles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

4 0.30803719 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufﬁcient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

5 0.30621198 46 nips-2005-Consensus Propagation

Author: Benjamin V. Roy, Ciamac C. Moallemi

Abstract: We propose consensus propagation, an asynchronous distributed protocol for averaging numbers across a network. We establish convergence, characterize the convergence rate for regular graphs, and demonstrate that the protocol exhibits better scaling properties than pairwise averaging, an alternative that has received much recent attention. Consensus propagation can be viewed as a special case of belief propagation, and our results contribute to the belief propagation literature. In particular, beyond singly-connected graphs, there are very few classes of relevant problems for which belief propagation is known to converge. 1

6 0.30383903 153 nips-2005-Policy-Gradient Methods for Planning

7 0.3031587 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

8 0.29951864 78 nips-2005-From Weighted Classification to Policy Search

9 0.29641318 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

10 0.29585239 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

11 0.29456234 144 nips-2005-Off-policy Learning with Options and Recognizers

12 0.29336178 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

13 0.29210621 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

14 0.28968763 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

15 0.28738683 177 nips-2005-Size Regularized Cut for Data Clustering

16 0.28696197 75 nips-2005-Fixing two weaknesses of the Spectral Method

17 0.28678972 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

18 0.28337824 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

19 0.28082097 23 nips-2005-An Application of Markov Random Fields to Range Sensing

20 0.28080398 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions