nips nips2005 nips2005-145 knowledge-graph by maker-knowledge-mining

145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning


Source: pdf

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. [sent-7, score-0.362]

2 We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. [sent-8, score-0.592]

3 We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). [sent-10, score-0.56]

4 This makes them applicable even in settings with a very large number of agents n. [sent-11, score-0.235]

5 1 Introduction Recently there has been great interest in distributed reinforcement learning problems where a collection of agents with independent action choices attempts to optimize a joint performance metric. [sent-12, score-0.505]

6 Problems with such factorizations where the global reward decomposes in to a sum of local rewards are common and have been studied in the RL literature. [sent-14, score-0.777]

7 [10] The most straightforward and common approach to solving these problems is to apply one of the many well-studied single agent algorithms to the global reward signal. [sent-15, score-0.7]

8 Effectively, this treats the multi-agent problem as a single agent problem with a very large action space. [sent-16, score-0.336]

9 [9] establish that policy gradient learning factorizes into independent policy gradient learning problems for each agent using the global reward signal. [sent-18, score-1.283]

10 [3] use global reward signals to estimate effective local rewards for each agent. [sent-20, score-0.78]

11 [5] consider coordinating agent actions using the global reward. [sent-22, score-0.394]

12 In particular, we show in Section 3 that as a function of the number of agents ˜ n, such algorithms will need to see1 Ω(n) trajectories in the worst case to achieve good performance. [sent-24, score-0.279]

13 notably [10]), of developing algorithms that capitalize on the availability of local reward signals to improve performance. [sent-26, score-0.48]

14 Our results show that such local information can dramatically reduce the number of examples necessary for learning to O(log n). [sent-27, score-0.135]

15 One approach that the results suggest to solving such distributed problems is to estimate model parameters from all local information available, and then to solve the resulting model offline. [sent-28, score-0.147]

16 Further, useful approximate multiple agent Markov Decision Process (MDP) solvers that take advantage of local reward structure have been developed. [sent-30, score-0.66]

17 [4] 2 Preliminaries We consider distributed reinforcement learning problems, modeled as MDPs, in which there are n (cooperative) agents, each of which can directly influence only a small number of its neighbors. [sent-31, score-0.129]

18 More formally, let there be n agents, each with a finite state space S of size |S| states and a finite action space A of size |A|. [sent-32, score-0.214]

19 The joint state space of all the agents is therefore S n , and the joint action space An . [sent-33, score-0.457]

20 If st ∈ S n is the joint state of the agents at (i) (i) time t, we will use st to denote the state of agent i. [sent-34, score-0.992]

21 , n} denote the subset of agents that i’s state directly influences. [sent-42, score-0.292]

22 Thus, the agents can be viewed as living on the vertices of a graph, where agents have a direct influence on each other’s state only if they are connected by an edge. [sent-44, score-0.527]

23 (Figure 1 depicts a DBN and an agent influence graph. [sent-46, score-0.239]

24 More formally, each agent i is associated with a CPT (conditional probability table) (i) (neigh(i)) (i) (neigh(i)) Pi (st+1 |st , at ), where st denotes the state of agent i’s neighbors at time t. [sent-48, score-0.749]

25 Given the joint action a of the agents, the joint state evolves according to n (i) (neigh(i)) p(st+1 |st p(st+1 |st , at ) = (i) , at ). [sent-49, score-0.222]

26 (1) i=1 For simplicity, we have assumed that agent i’s state is directly influenced by the states of neigh(i) but not their actions; the generalization offers no difficulties. [sent-50, score-0.296]

27 The initial state s1 is distributed according to some initial-state distribution D. [sent-51, score-0.086]

28 , πn (s)), where πi (s) : S n → A is the local policy of agent i. [sent-56, score-0.61]

29 For some applications, we may wish to consider only policies in which agent i chooses its local action as a function of only its local state s(i) (and possibly its neighbors); in this case, πi can be restricted to depend only on s(i) . [sent-57, score-0.66]

30 Each agent has a local reward function Ri (s(i) , a(i) ), which takes values in the unit intern val [0, 1]. [sent-58, score-0.66]

31 We call this R(s, a) the global reward function, since it reflects the total reward received by the joint set of agents. [sent-60, score-0.81]

32 Thus, the utility of a policy π in an MDP M is U (π) = UM (π) = Es1 ∼D [V π (s1 )] = E 1 n T n (i) (i) Ri (st , at )|π . [sent-62, score-0.323]

33 t=1 i=1 In the reinforcement learning setting, the dynamics (CPTs) and rewards of the problem are unknown, and a learning algorithm has to take actions in the MDP and use the resulting observations of state transitions and rewards to learn a good policy. [sent-63, score-0.744]

34 Each “trial” taken by a reinforcement learning algorithm shall consist of a T -step sequence in the MDP. [sent-64, score-0.118]

35 A connection between nodes in the graph implies arrows connecting the nodes in the DBN. [sent-68, score-0.08]

36 Our goal is to characterize the scaling of the sample complexity for various reinforcement learning approaches (i. [sent-69, score-0.127]

37 , how many trials they require in order to learn a near-optimal policy) for large numbers of agents n. [sent-71, score-0.328]

38 In contrast, in Section 4, we show that if a reinforcement learning algorithm is given access to the local rewards, it can be possible to learn in such problems with an exponentially smaller O(log n) sample complexity. [sent-77, score-0.242]

39 Let any reinforcement learning algorithm L be given that only uses the global reward signal R(s), and does not use the local rewards Ri (s(i) ) to learn (other than through their sum). [sent-81, score-0.931]

40 The MDP is very “simple” in that it has only one state (|S| = 1, |S n | = 1); trivial state transition probabilities (since T = 1); two actions per agent (|A| = 2); and deterministic binary (0/1)-valued local reward functions. [sent-83, score-0.887]

41 In order for L to output a policy π that is near-optimal satisfying2 U (ˆ ) ≥ ˆ π maxπ U (π) − ǫ,it is necessary that the number of trials m be at least 0. [sent-85, score-0.396]

42 For simplicity, we first assume that L is a deterministic learning algorithm, so that in each of the m trials, its choice of action is some deterministic function of the outcomes of the earlier trials. [sent-88, score-0.215]

43 Thus, in each of the m trials, L chooses a vector of actions a ∈ AN , n 1 and receives the global reward signal R(s, a) = n i=1 R(s(i) , a(i) ). [sent-89, score-0.579]

44 In our MDP, each (i) (i) local reward R(s , a ) will take values only 0 and 1. [sent-90, score-0.441]

45 Since T = 1, the algorithm receives only one such n reward value in each trial. [sent-95, score-0.36]

46 , rm be the m global reward signals received by L in the m trials. [sent-99, score-0.483]

47 Since L is deterministic, its output policy π will be chosen as some deterministic function of these ˆ 2 For randomized algorithms we consider instead the expectation of U (ˆ ) under the algorithm’s π randomization. [sent-100, score-0.37]

48 Specifically, each local reward Ri (s(i) , a(i) ) function is randomly chosen with equal probability to either give reward 1 for action a1 and reward 0 for action a2 ; or vice versa. [sent-113, score-1.379]

49 Thus, each local agent has one “right” action that gives reward 1, but the algorithm has to learn which of the two actions this is. [sent-114, score-0.893]

50 Further, by choosing the right actions, the optimal policy π ∗ attains U (π ∗ ) = 1. [sent-115, score-0.292]

51 5) random variables (since the rewards are chosen randomly), and has expected value 0. [sent-118, score-0.216]

52 Thus, taking a union bound over all policies π ∈ ΠM , we have P (∃π ∈ ΠM s. [sent-122, score-0.122]

53 But since L outputs a policy in ΠM , the chance of L outputting a policy π with UM (ˆ ) ≥ 1 − 2ǫ is bounded by the chance that ˆ π there exists such a policy in ΠM . [sent-127, score-0.98]

54 For randomized algorithms L, we can define for each string of input random numbers to the algorithm ω a deterministic algorithm Lω . [sent-137, score-0.114]

55 In this section, we will assume that the neighborhood structure (encoded by neigh(i)) is known, but that the CPT parameters of the dynamics and the reward functions are unknown. [sent-140, score-0.393]

56 We also assume that the size of the largest neighborhood is bounded by maxi |neigh(i)| = B. [sent-141, score-0.081]

57 1: Suppose the MDP’s initial state distribution is random, so that the state (i) si of each agent i is chosen independently from some distribution Di . [sent-145, score-0.358]

58 Further, assume that Di assigns probability at least ρ > 0 to each possible state value s ∈ S. [sent-146, score-0.095]

59 Then the “random” policy π (that on each time-step chooses each agent’s action uniformly at 1 random over A) is a (ρ, |A| )-exploration policy. [sent-147, score-0.442]

60 For any agent i, the initial state of s(neigh(i)) has has at least a ρB chance of being any particular vector of values, and the random action policy has a 1/|A| chance of taking any particular action from this state. [sent-149, score-0.916]

61 Specifically, it guarantees that we visit each local configuration sufficiently often to have a reasonable amount of data to estimate each CPT. [sent-152, score-0.099]

62 3 In the envisioned procedure, we will execute an exploration policy for m trials, and then use the resulting data we collect to obtain the maximum-likelihood estimates for the (i) (neigh(i)) (i) CPT entries and the rewards. [sent-153, score-0.39]

63 4 The following simple lemma shows that, with a number of trials that grows only logarithmically in n, this procedure will give us good estimates for all CPTs and local rewards. [sent-155, score-0.236]

64 Suppose |neigh(i)| ≤ B for all i, and let a (ρ, ν)-exploration policy be executed for m trials. [sent-158, score-0.332]

65 Taking a union bound over them, setting our probability of incorrectly estimating any CPTs or rewards to δ/2, and solving for c gives B+1 c ≥ ǫ2 log( 4 n |A||S| ). [sent-164, score-0.319]

66 For each agent i we see each local configurations of states and 2 δ 0 actions (s(neigh(i)) , a(i) ) with probability ≥ ρB ν. [sent-165, score-0.431]

67 ˆ served in the training data, and similarly let R(s (s(neigh(i)) ,a(i) ) of samples we see for each CPT entry is at least mρB ν. [sent-169, score-0.091]

68 mT 2 Applying again the union bound to ensure that the probability of failure here is ≤ δ/2 and solving for m gives the result. [sent-173, score-0.103]

69 Define the radius of influence r(t) after t steps to be the maximum number of nodes that are within t steps in the neighborhood graph of any single node. [sent-175, score-0.103]

70 If the neighborhood graph is a 2-d lattice in which each node has at most 4 neighbors, then r(t) = O(t2 ). [sent-178, score-0.075]

71 Note that, even in the worst case, by our assumption of each node having B neighbors, we still have the bound r(t) ≤ B t , which is a bound independent of the number of agents n. [sent-180, score-0.315]

72 Suppose |neigh(i)| ≤ B for all i, and let a ˆ (ρ, ν)-exploration policy be executed for m trials in the MDP M . [sent-183, score-0.4]

73 Let Π be a policy class, and let π = arg max UM (π) ˆ ˆ π∈Π ˆ be the best policy in the class, as evaluated on M . [sent-185, score-0.604]

74 Our approach is essentially constructive: we show that for any policy, finite-horizon value-iteration using approximate CPTs and rewards in its backups will correctly estimate the true value function for that policy within ǫ/2. [sent-190, score-0.544]

75 2) with m samples we can know both CPTs and rewards with the probability required within any required ǫ0 . [sent-193, score-0.265]

76 Note also that for any MDP with the given DBN or neighborhood graph structure (including ˆ both M and M ) the value function for every policy π and at each time-step has a property of bounded variation: r(T )T (i) ˆ ˆ |Vt (s(1) , . [sent-194, score-0.397]

77 , s(n) | ≤ n This follows since a change in state can effect at most r(T ) agents’ states, so the resulting change in utility must be bounded by r(T )T /n. [sent-203, score-0.118]

78 To compute a bound on the error in our estimate of overall utility we compute a bound ˆ ˆˆ on the error induced by a one-step Bellman backup ||B V − B V ||∞ . [sent-204, score-0.147]

79 This quantity can be bounded in turn by considering the sequence of partially correct backup operators ˆ ˆ ˆ B0 , . [sent-205, score-0.086]

80 , Bn where Bi is defined as the Bellman operator for policy π using the exact transitions and rewards for agents 1, 2, . [sent-208, score-0.764]

81 , i, and the estimated transitions rewards/transitions 2500 200 agents, 20% noise is observed rewards 1 local learner global learner local learner global learner 0. [sent-211, score-0.823]

82 1 0 0 500 1000 1500 number of training examples 2000 2500 0 0 50 100 150 200 250 number of agents 300 350 400 Figure 2: (Left) Scaling of performance as a function of the number of trajectories seen for a global reward and local reward algorithms. [sent-220, score-1.146]

83 (Right) Scaling of the number of samples necessary to achieve near optimal reward as a function of the number of agents. [sent-221, score-0.389]

84 5 Demonstration We first present an experimental domain that hews closely to the theory in Section (3) above to demonstrate the importance of local rewards. [sent-236, score-0.099]

85 In our simple problem there are n = 400 independent agents who each choose an action in {0, 1}. [sent-237, score-0.352]

86 Each agent has a “correct” action that earns it reward Ri = 1 with probability 0. [sent-238, score-0.729]

87 Equally, if the agents chooses the wrong action, it earns reward Ri = 1 with probability 0. [sent-241, score-0.661]

88 Our first global algorithm uses only the global rewards R and uses this to build a model of the local rewards, and finally solves the resulting estimated MDP exactly. [sent-244, score-0.537]

89 The local reward functions are learnt by a least-squares procedure with basis functions for each agent. [sent-245, score-0.441]

90 The second algorithm also learns a local reward function, but does so taking advantage of the local rewards it observes as opposed to only the global signal. [sent-246, score-0.898]

91 Figure (2) demonstrates the advantages of learning using a global 1 reward signal. [sent-247, score-0.462]

92 5 On the right in Figure (2), we compute the time required to achieve 4 of optimal reward for each algorithm, as a function of the number of agents. [sent-248, score-0.342]

93 In our next example, we consider a simple variant of the multi-agent S YS A DMIN6 prob5 A gradient-based model-free approach using the global reward signal was also tried, but its performance was significantly poorer than that of the two algorithms depicted in Figure (2, left). [sent-249, score-0.491]

94 Again, we consider two algorithms: a global R EINFORCE [9] learner, and a R E INFORCE algorithm run using only local rewards, even through the local R EINFORCE algorithm run in this way is not guaranteed to converge to the globally optimal (cooperative) solution. [sent-254, score-0.336]

95 We note that the local algorithm learns much more quickly than using the global reward. [sent-255, score-0.219]

96 (Figure 3) The learning speed we observed for the global algorithm correlates well with the observations in [5] that the number of samples needed scales roughly linearly in the number of agents. [sent-256, score-0.167]

97 The local algorithm continued to require essentially the same number of examples for all sizes used (up to over 100 agents) in our experiments. [sent-257, score-0.117]

98 Local refers to R EINFORCE applied using only neighborhood (local) rewards while global refers to standard R EINFORCE (applied to the global reward signal). [sent-281, score-0.813]

99 (Left) shows averaged reward performance as a function of number of iterations for 10 agents. [sent-282, score-0.342]

100 All learning is local: Multi-agent learning in global reward games. [sent-299, score-0.48]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('neigh', 0.571), ('reward', 0.342), ('policy', 0.292), ('agents', 0.235), ('agent', 0.219), ('rewards', 0.216), ('st', 0.2), ('mdp', 0.182), ('um', 0.158), ('cpt', 0.124), ('action', 0.117), ('dbn', 0.113), ('cpts', 0.107), ('global', 0.102), ('local', 0.099), ('einforce', 0.089), ('reinforcement', 0.082), ('sm', 0.079), ('actions', 0.073), ('bi', 0.069), ('trials', 0.068), ('ep', 0.059), ('exploration', 0.057), ('state', 0.057), ('traf', 0.057), ('neighborhood', 0.051), ('bellman', 0.05), ('ys', 0.047), ('learner', 0.046), ('guestrin', 0.042), ('ri', 0.042), ('bound', 0.04), ('deterministic', 0.04), ('hoeffding', 0.04), ('cooperative', 0.04), ('chance', 0.037), ('uence', 0.036), ('policies', 0.036), ('backup', 0.036), ('backups', 0.036), ('kearns', 0.036), ('neighbors', 0.034), ('chooses', 0.033), ('utility', 0.031), ('formalisms', 0.031), ('earns', 0.031), ('chang', 0.031), ('dmin', 0.031), ('peshkin', 0.031), ('vt', 0.03), ('bn', 0.03), ('bounded', 0.03), ('samples', 0.029), ('signal', 0.029), ('distributed', 0.029), ('planning', 0.029), ('log', 0.028), ('nodes', 0.028), ('scaling', 0.027), ('poly', 0.026), ('trajectories', 0.026), ('si', 0.025), ('lemma', 0.025), ('learn', 0.025), ('logarithmically', 0.025), ('factored', 0.025), ('entry', 0.024), ('union', 0.024), ('graph', 0.024), ('joint', 0.024), ('bounds', 0.023), ('theoretic', 0.023), ('horizon', 0.023), ('mdps', 0.023), ('taking', 0.022), ('entries', 0.022), ('transitions', 0.021), ('visited', 0.021), ('signals', 0.021), ('guration', 0.021), ('let', 0.02), ('operators', 0.02), ('executed', 0.02), ('depicts', 0.02), ('randomized', 0.02), ('probability', 0.02), ('states', 0.02), ('solving', 0.019), ('rl', 0.019), ('estimates', 0.019), ('uai', 0.018), ('learning', 0.018), ('exp', 0.018), ('rm', 0.018), ('necessary', 0.018), ('least', 0.018), ('sum', 0.018), ('logarithmic', 0.018), ('algorithms', 0.018), ('algorithm', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

2 0.33907196 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

3 0.28619727 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

4 0.22998586 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

5 0.20540768 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

6 0.19710892 36 nips-2005-Bayesian models of human action understanding

7 0.19564575 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

8 0.18302837 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

9 0.16709201 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.15079579 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

11 0.13846801 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

12 0.11742975 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

13 0.11063558 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

14 0.10750131 111 nips-2005-Learning Influence among Interacting Markov Chains

15 0.090028293 53 nips-2005-Cyclic Equilibria in Markov Games

16 0.085591681 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

17 0.085191473 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

18 0.05884837 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

19 0.055224184 141 nips-2005-Norepinephrine and Neural Interrupts

20 0.052802116 148 nips-2005-Online Discovery and Learning of Predictive State Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, -0.018), (2, 0.548), (3, -0.163), (4, -0.063), (5, 0.024), (6, 0.035), (7, -0.036), (8, 0.01), (9, -0.076), (10, -0.004), (11, -0.067), (12, 0.053), (13, 0.029), (14, -0.008), (15, -0.02), (16, 0.054), (17, -0.046), (18, 0.019), (19, -0.04), (20, 0.055), (21, 0.061), (22, -0.007), (23, 0.054), (24, -0.016), (25, -0.022), (26, -0.083), (27, -0.049), (28, -0.056), (29, -0.011), (30, -0.001), (31, 0.022), (32, -0.049), (33, -0.018), (34, 0.028), (35, -0.009), (36, -0.053), (37, -0.0), (38, 0.004), (39, 0.017), (40, 0.028), (41, 0.009), (42, -0.034), (43, 0.025), (44, -0.013), (45, 0.03), (46, 0.034), (47, 0.016), (48, -0.027), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97028011 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

2 0.84148341 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

3 0.82021242 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

4 0.7920202 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

5 0.691302 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

6 0.65873516 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

7 0.65392536 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

8 0.64533788 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.5939458 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

10 0.56904596 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

11 0.54016101 36 nips-2005-Bayesian models of human action understanding

12 0.50201726 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

13 0.48723611 53 nips-2005-Cyclic Equilibria in Markov Games

14 0.39420068 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

15 0.37346983 111 nips-2005-Learning Influence among Interacting Markov Chains

16 0.33761674 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

17 0.32193145 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

18 0.31429732 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

19 0.28860784 141 nips-2005-Norepinephrine and Neural Interrupts

20 0.27000731 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.062), (10, 0.047), (27, 0.045), (31, 0.485), (34, 0.065), (55, 0.023), (69, 0.042), (88, 0.062), (91, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97555614 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

Author: Viren Jain, Valentin Zhigulin, H. S. Seung

Abstract: There is little consensus about the computational function of top-down synaptic connections in the visual system. Here we explore the hypothesis that top-down connections, like bottom-up connections, reflect partwhole relationships. We analyze a recurrent network with bidirectional synaptic interactions between a layer of neurons representing parts and a layer of neurons representing wholes. Within each layer, there is lateral inhibition. When the network detects a whole, it can rigorously enforce part-whole relationships by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the theory of permitted and forbidden sets [3, 4]. The network behaviors are illustrated by recreating Rumelhart and McClelland’s “interactive activation” model [7]. In neural network models of visual object recognition [2, 6, 8], patterns of synaptic connectivity often reflect part-whole relationships between the features that are represented by neurons. For example, the connections of Figure 1 reflect the fact that feature B both contains simpler features A1, A2, and A3, and is contained in more complex features C1, C2, and C3. Such connectivity allows neurons to follow the rule that existence of the part is evidence for existence of the whole. By combining synaptic input from multiple sources of evidence for a feature, a neuron can “decide” whether that feature is present. 1 The synapses shown in Figure 1 are purely bottom-up, directed from simple to complex features. However, there are also top-down connections in the visual system, and there is little consensus about their function. One possibility is that top-down connections also reflect part-whole relationships. They allow feature detectors to make decisions using the rule that existence of the whole is evidence for existence of its parts. In this paper, we analyze the dynamics of a recurrent network in which part-whole relationships are stored as bidirectional synaptic interactions, rather than the unidirectional interactions of Figure 1. The network has a number of interesting computational capabilities. When the network detects a whole, it can rigorously enforce part-whole relationships 1 Synaptic connectivity may reflect other relationships besides part-whole. For example, invariances can be implemented by connecting detectors of several instances of the same feature to the same target, which is consequently an invariant detector of the feature. C1 C2 C3 B A1 A2 A3 Figure 1: The synaptic connections (arrows) of neuron B represent part-whole relationships. Feature B both contains simpler features and is contained in more complex features. The synaptic interactions are drawn one-way, as in most models of visual object recognition. Existence of the part is regarded as evidence for existence of the whole. This paper makes the interactions bidirectional, allowing the existence of the whole to be evidence for the existence of its parts. by ignoring parts that do not belong. The network can complete the whole by filling in missing parts. The network can refuse to recognize a whole, if the activated parts do not conform to a stored part-whole relationship. Parameter regimes in which these behaviors happen are identified using the recently developed theory of permitted and forbidden sets [3, 4]. Our model is closely related to the interactive activation model of word recognition, which was proposed by McClelland and Rumelhart to explain the word superiority effect studied by visual psychologists [7]. Here our concern is not to model a psychological effect, but to characterize mathematically how computations involving part-whole relationships can be carried out by a recurrent network. 1 Network model Suppose that we are given a set of part-whole relationships specified by a ξi = 1, if part i is contained in whole a 0, otherwise We assume that every whole contains at least one part, and every part is contained in at least one whole. The stimulus drives a layer of neurons that detect parts. These neurons also interact with a layer of neurons that detect wholes. We will refer to part-detectors as “P-neurons” and whole-detectors as “W-neurons.” The part-whole relationships are directly stored in the synaptic connections between P and a W neurons. If ξi = 1, the ith neuron in the P layer and the ath neuron in the W layer have a an excitatory interaction of strength γ. If ξi = 0, the neurons have an inhibitory interaction of strength σ. Furthermore, the P-neurons inhibit each other with strength β, and the Wneurons inhibit each other with strength α. All of these interactions are symmetric, and all activation functions are the rectification nonlinearity [z]+ = max{z, 0}. Then the dynamics of the network takes the form  ˙ Wa + Wa a Pi ξ i − σ = γ i + a (1 − ξi )Pi − α i Wb  , +  ˙ Pi + Pi (1) b=a a Wa ξ i − σ = γ a a (1 − ξi )Wa − β a Pj + B i  . j=i (2) where Bi is the input to the P layer from the stimulus. Figure 2 shows an example of a network with two wholes. Each whole contains two parts. One of the parts is contained in both wholes. -α Wa excitation γ -σ inhibition P1 B1 -β } W layer Wb -σ P2 -β B2 P3 } P layer B3 Figure 2: Model in example configuration: ξ = {(1, 1, 0), (0, 1, 1)}. When a stimulus is presented, it activates some of the P-neurons, which activate some of the W-neurons. The network eventually converges to a stable steady state. We will assume that α > 1. In the Appendix, we prove that this leads to unconditional winner-take-all behavior in the W layer. In other words, no more than one W-neuron can be active at a stable steady state. If a single W-neuron is active, then a whole has been detected. Potentially there are also many P-neurons active, indicating detection of parts. This representation may have different properties, depending on the choice of parameters β, γ, and σ. As discussed below, these include rigorous enforcement of part-whole relationships, completion of wholes by “filling in” missing parts, and non-recognition of parts that do not conform to a whole. 2 Enforcement of part-whole relationships Suppose that a single W-neuron is active at a stable steady state, so that a whole has been detected. Part-whole relationships are said to be enforced if the network always ignores parts that are not contained in the detected whole, despite potentially strong bottom-up evidence for them. It can be shown that enforcement follows from the inequality σ 2 + β 2 + γ 2 + 2σβγ > 1. (3) which guarantees that neuron i in the P layer is inactive, if neuron a in the W layer is a active and ξi = 0. When part-whole relations are enforced, prior knowledge about legal combinations of parts strictly constrains what may be perceived. This result is proven in the Appendix, and only an intuitive explanation is given here. Enforcement is easiest to understand when there is interlayer inhibition (σ > 0). In this case, the active W-neuron directly inhibits the forbidden P-neurons. The case of σ = 0 is more subtle. Then enforcement is mediated by lateral inhibition in the P layer. Excitatory feedback from the W-neuron has the effect of counteracting the lateral inhibition between the P-neurons that belong to the whole. As a result, these P-neurons become strongly activated enough to inhibit the rest of the P layer. 3 Completion of wholes by filling in missing parts If a W-neuron is active, it excites the P-neurons that belong to the whole. As a result, even if one of these P-neurons receives no bottom-up input (Bi = 0), it is still active. We call this phenomenon “completion,” and it is guaranteed to happen when (4) γ> β The network may thus “imagine” parts that are consistent with the recognized whole, but are not actually present in the stimulus. As with enforcement, this condition depends on top-down connections. √ In the special case γ = β, the interlayer excitation between a W-neuron and its P-neurons exactly cancels out the lateral inhibition between the P-neurons at a steady state. So the recurrent connections effectively vanish, letting the activity of the P-neurons be determined by their feedforward inputs. When the interlayer excitation is stronger than this, the inequality (4) holds, and completion occurs. 4 Non-recognition of a whole If there is no interlayer inhibition (σ = 0), then a single W-neuron is always active, assuming that there is some activity in the P layer. To see this, suppose for the sake of contradiction that all the W-neurons are inactive. Then they receive no inhibition to counteract the excitation from the P layer. This means some of them must be active, which contradicts our assumption. This means that the network always recognizes a whole, even if the stimulus is very different from any part-whole combination that is stored in the network. However, if interlayer inhibition is sufficiently strong (large σ), the network may refuse to recognize a whole. Neurons in the P layer are activated, but there is no activity in the W layer. Formal conditions on σ can be derived, but are not given here because of space limitations. In case of non-recognition, constraints on the P-layer are not enforced. It is possible for the network to detect a configuration of parts that is not consistent with any stored whole. 5 Example: Interactive Activation model To illustrate the computational capabilities of our network, we use it to recreate the interactive activation (IA) model of McClelland and Rumelhart. Figure 3 shows numerical simulations of a network containing three layers of neurons representing strokes, letters, and words, respectively. There are 16 possible strokes in each of four letter positions. For each stroke, there are two neurons, one signaling the presence of the stroke and the other signaling its absence. Letter neurons represent each letter of the alphabet in each of four positions. Word neurons represent each of 1200 common four letter words. The letter and word layers correspond to the P and W layers that were introduced previously. There are bidirectional interactions between the letter and word layers, and lateral inhibition within the layers. The letter neurons also receive input from the stroke neurons, but this interaction is unidirectional. Our network differs in two ways from the original IA model. First, all interactions involving letter and word neurons are symmetric. In the original model, the interactions between the letter and word layers were asymmetric. In particular, inhibitory connections only ran from letter neurons to word neurons, and not vice versa. Second, the only nonlinearity in our model is rectification. These two aspects allow us to apply the full machinery of the theory of permitted and forbidden sets. Figure 3 shows the result of presenting the stimulus “MO M” for four different settings of parameters. In each of the four cases, the word layer of the network converges to the same result, detecting the word “MOON”, which is the closest stored word to the stimulus. However, the activity in the letter layer is different in the four cases. input: P layer reconstruction W layer P layer reconstruction W layer completion noncompletion enforcement non-enforcement Figure 3: Simulation of 4 different parameter regimes in a letter- word recognition network. Within each panel, the middle column presents a feature- layer reconstruction based on the letter activity shown in the left column. W layer activity is shown in the right column. The top row shows the network state after 10 iterations of the dynamics. The bottom row shows the steady state. In the left column, the parameters obey the inequality (3), so that part- whole relationships are enforced. The activity of the letter layer is visualized by activating the strokes corresponding to each active letter neuron. The activated letters are part of the word “MOON”. In the top left, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom left, completion does not occur. In the simulations of the right column, parameters are such that part- whole relationships are not enforced. Consequently, the word layer is much more active. Bottom- up input provides evidence for several other letters, which is not suppressed. In the top right, the inequality (4) is satisfied, so that the missing “O” in the stimulus is filled in. In the bottom right, the “O” neuron is not activated in the third position, so there is no completion. However, some letter neurons for the third position are activated, due to the input from neurons that indicate the absence of strokes. input: non-recognition event multi-stability Figure 4: Simulation of a non- recognition event and example of multistability. Figure 4 shows simulations for large σ, deep in the enforcement regime where non- recognition is a possibility. From one initial condition, the network converges to a state in which no W neurons are active, a non- recognition. From another initial condition, the network detects the word “NORM”. Deep in the enforcement regime, the top- down feedback can be so strong that the network has multiple stable states, many of which bear little resemblance to the stimulus at all. This is a problematic aspect of this network. It can be prevented by setting parameters at the edge of the enforcement regime. 6 Discussion We have analyzed a recurrent network that performs computations involving part- whole relationships. The network can fill in missing parts and suppress parts that do not belong. These two computations are distinct and can be dissociated from each other, as shown in Figure 3. While these two computations can also be performed by associative memory models, they are not typically dissociable in these models. For example, in the Hopfield model pattern completion and noise suppression are both the result of recall of one of a finite number of stereotyped activity patterns. We believe that our model is more appropriate for perceptual systems, because its behavior is piecewise linear, due its reliance on rectification nonlinearity. Therefore, analog aspects of computation are able to coexist with the part-whole relationships. Furthermore, in our model the stimulus is encoded in maintained synaptic input to the network, rather than as an initial condition of the dynamics. A Appendix: Permitted and forbidden sets Our mathematical results depend on the theory of permitted and forbidden sets [3, 4], which is summarized briefly here. The theory is applicable to neural networks with rectification nonlinearity, of the form xi + xi = [bi + j Wij xj ]+ . Neuron i is said to be active when ˙ xi > 0. For a network of N neurons, there are 2N possible sets of active neurons. For each active set, consider the submatrix of Wij corresponding to the synapses between active neurons. If all eigenvalues of this submatrix have real parts less than or equal to unity, then the active set is said to be permitted. Otherwise the active set is said to be forbidden. A set is permitted if and only if there exists an input vector b such that those neurons are active at a stable steady state. Permitted sets can be regarded as memories stored in the synaptic connections Wij . If Wij is a symmetric matrix, the nesting property holds: every subset of a permitted set is permitted, and every superset of a forbidden set is forbidden. The present model can be seen as a general method for storing permitted sets in a recurrent network. This method introduces a neuron for each permitted set, relying on a unary or “grandmother cell” representation. In contrast, Xie et al.[9] used lateral inhibition in a single layer of neurons to store permitted sets. By introducing extra neurons, the present model achieves superior storage capacity, much as unary models of associative memory [1] surpass distributed models [5]. A.1 Unconditional winner-take-all in the W layer The synapses between two W-neurons have strengths 0 −α −α 0 The eigenvalues of this matrix are ±α. Therefore two W-neurons constitute a forbidden set if α > 1. By the nesting property, it follows more than two W-neurons is also a forbidden set, and that the W layer has the unconditional winner-take-all property. A.2 Part-whole combinations as permitted sets Theorem 1. Suppose that β < 1. If γ 2 < β + (1 − β)/k then any combination of k ≥ 1 parts consistent with a whole corresponds to a permitted set. Proof. Consider k parts belonging to a whole. They are represented by one W-neuron and k P-neurons, with synaptic connections given by the (k + 1) × (k + 1) matrix M= −β(11T − I) γ1 , γ1T 0 (5) where 1 is the k- dimensional vector whose elements are all equal to one. Two eigenvectors of M are of the form (1T c), and have the same eigenvalues as the 2 × 2 matrix −β(k − 1) γk γ 0 This matrix has eigenvalues less than one when γ 2 < β + (1 − β)/k and β(k − 1) + 2 > 0. The other k − 1 eigenvectors are of the form (dT , 0), where dT 1 = 0. These have eigenvalues β. Therefore all eigenvalues of W are less than one if the condition of the theorem is satisfied. A.3 Constraints on combining parts Here, we derive conditions under which the network can enforce the constraint that steady state activity be confined to parts that constitute a whole. Theorem 2. Suppose that β > 0 and σ 2 +β 2 +γ 2 +2σβγ > 1 If a W- neuron is active, then only P- neurons corresponding to parts contained in the relevant whole can be active at a stable steady state. Proof. Consider P- neurons Pi , Pj , and W- neuron Wa . Supa a pose that ξi = 1 but ξj = 0. As shown in Figure 5, the matrix of connections is given by: W = 0 −β γ −β 0 −σ γ −σ 0 (6) Wa γ Pi -σ -β Pj Figure 5: A set of one W- neuron and two P- neurons is forbidden if one part belongs to the whole and the other does not. This set is permitted if all eigenvalues of W − I have negative real parts. The characteristic equation of I − W is λ3 + b1 λ2 + b2 λ + b3 = 0, where b1 = 3, b2 = 3 − σ 2 − β 2 − γ 2 and b3 = 1−2σβγ−σ 2 −β 2 −γ 2 . According to the Routh- Hurwitz theorem, all the eigenvalues have negative real parts if and only if b1 > 0, b3 > 0 and b1 b2 > b3 . Clearly, the first condition is always satisfied. The second condition is more restrictive than the third. It is satisfied only when σ 2 + β 2 + γ 2 + 2σβγ < 1. Hence, one of the eigenvalues has a positive real part when this condition is broken, i.e., when σ 2 +β 2 +γ 2 +2σβγ > 1. By the nesting property, any larger set of P- neurons inconsistent with the W- neuron is also forbidden. A.4 Completion of wholes √ Theorem 3. If γ > β and a single W- neuron a is active at a steady state, then Pi > 0 a for all i such that ξi = 1. Proof. Suppose that the detected whole has k parts. At the steady state Pi = a ξi Bi − (β − γ 2 )Ptot 1−β + where Ptot = Pi = i 1 1 − β + (β − γ 2 )k k a B i ξi i=1 (7) A.5 Preventing runaway If feedback loops cause the network activity to diverge, then the preceding analyses are not relevant. Here we give a sufficient condition guaranteeing that runaway instability does not happen. It is not a necessary condition. Interestingly, the condition implies the condition of Theorem 1. Theorem 4. Suppose that P and W obey the dynamics of Eqs. (1) and (2), and define the objective function E 1−α 2 = − 2 Wa a α + 2 Wa a 1−β + 2 a Pi Wa ξi + σ Bi Pi − γ i 2 ia Pi2 i 2 β + 2 Pi i a (1 − ξi )Pi Wa . (8) ia Then E is a Lyapunov like function that, given β > γ 2 − dynamics to a stable steady state. 1−γ 2 N −1 , ensures convergence of the Proof. (sketch) Differentiation of E with respect to time shows that that E is nonincreasing in the nonnegative orthant and constant only at steady states of the network dynamics. We must also show that E is radially unbounded, which is true if the quadratic part of E is copositive definite. Note that the last term of E is lower-bounded by zero and the previous term is upper bounded by γ ia Pi Wa . We assume α > 1. Thus, we can use Cauchy’s 2 2 inequality, i Pi2 ≥ ( i Pi ) /N , and the fact that a Wa ≤ ( a Wa )2 for Wa ≥ 0, to derive E≥ 1 2 Wa )2 + ( If β > γ 2 − unbounded. a 1−γ 2 N −1 , 1 − β + βN ( N Pi )2 − 2γ( i Wa a Pi ) i − Bi Pi . (9) i the quadratic form in the inequality is positive definite and E is radially References [1] E. B. Baum, J. Moody, and F. Wilczek. Internal representations for associative memory. Biol. Cybern., 59:217–228, 1988. [2] K. Fukushima. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern, 36(4):193–202, 1980. [3] R.H. Hahnloser, R. Sarpeshkar, M.A. Mahowald, R.J. Douglas, and H.S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947– 51, Jun 22 2000. [4] R.H. Hahnloser, H.S. Seung, and J.-J. Slotine. Permitted and forbidden sets in symmetric threshold-linear networks. Neural Computation, 15:621–638, 2003. [5] J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A, 79(8):2554–8, Apr 1982. [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1:541–551, 1989. [7] J. L. McClelland and D. E. Rumelhart. An interactive activation model of context effects in letter perception: Part i. an account of basic findings. Psychological Review, 88(5):375–407, Sep 1981. [8] M Riesenhuber and T Poggio. Hierarchical models of object recognition in cortex. Nat Neurosci, 2(11):1019–25, Nov 1999. [9] X. Xie, R.H. Hahnloser, and H. S. Seung. Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14:2627–2646, 2002.

2 0.95949638 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Author: Dmitry Malioutov, Alan S. Willsky, Jason K. Johnson

Abstract: This paper presents a new framework based on walks in a graph for analysis and inference in Gaussian graphical models. The key idea is to decompose correlations between variables as a sum over all walks between those variables in the graph. The weight of each walk is given by a product of edgewise partial correlations. We provide a walk-sum interpretation of Gaussian belief propagation in trees and of the approximate method of loopy belief propagation in graphs with cycles. This perspective leads to a better understanding of Gaussian belief propagation and of its convergence in loopy graphs. 1

same-paper 3 0.92925805 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

4 0.92478645 108 nips-2005-Layered Dynamic Textures

Author: Antoni B. Chan, Nuno Vasconcelos

Abstract: A dynamic texture is a video model that treats a video as a sample from a spatio-temporal stochastic process, specifically a linear dynamical system. One problem associated with the dynamic texture is that it cannot model video where there are multiple regions of distinct motion. In this work, we introduce the layered dynamic texture model, which addresses this problem. We also introduce a variant of the model, and present the EM algorithm for learning each of the models. Finally, we demonstrate the efficacy of the proposed model for the tasks of segmentation and synthesis of video.

5 0.92303938 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1

6 0.68579215 78 nips-2005-From Weighted Classification to Policy Search

7 0.6467663 153 nips-2005-Policy-Gradient Methods for Planning

8 0.62093484 46 nips-2005-Consensus Propagation

9 0.61035329 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.58495468 111 nips-2005-Learning Influence among Interacting Markov Chains

11 0.58339256 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

12 0.579458 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

13 0.55791098 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

14 0.5561586 36 nips-2005-Bayesian models of human action understanding

15 0.55609417 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

16 0.53948259 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

17 0.53807592 139 nips-2005-Non-iterative Estimation with Perturbed Gaussian Markov Processes

18 0.5271666 181 nips-2005-Spiking Inputs to a Winner-take-all Network

19 0.52597344 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

20 0.52455485 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions