nips nips2001 nips2001-148 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael L. Littman, Richard S. Sutton
Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. [sent-7, score-0.176]
2 State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. [sent-8, score-0.319]
3 Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. [sent-9, score-0.299]
4 Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). [sent-10, score-0.265]
5 We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. [sent-11, score-0.493]
6 In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. [sent-12, score-0.286]
7 The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. [sent-14, score-0.458]
8 The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. [sent-15, score-0.181]
9 (The data flow in these two approaches are diagrammed in Figure 1. [sent-16, score-0.035]
10 The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. [sent-18, score-0.268]
11 The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. [sent-19, score-0.063]
12 There are algorithms for simultaneously estimating state and dynamics (e. [sent-21, score-0.143]
13 , Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al. [sent-23, score-0.041]
14 - (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . [sent-29, score-0.2]
15 state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. [sent-30, score-0.464]
16 Here, the state representation is a relatively simple record of the stream of past actions and observations. [sent-32, score-0.423]
17 It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. [sent-33, score-0.067]
18 Such representations are far more closely linked to the data than are POMDP representations. [sent-34, score-0.057]
19 One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. [sent-35, score-0.057]
20 The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. [sent-38, score-0.487]
21 However, the PSR approach is also like the history-based approach in that its representations are grounded in data. [sent-40, score-0.113]
22 Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. [sent-41, score-0.132]
23 In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). [sent-42, score-0.419]
24 For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. [sent-43, score-0.258]
25 The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i. [sent-44, score-0.34]
26 Each test is a kind of experiment that could be performed to tell us something about the system. [sent-47, score-0.083]
27 If we knew the outcome of all possible tests, then we would know everything there is to know about the system. [sent-48, score-0.188]
28 A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). [sent-49, score-0.808]
29 As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. [sent-50, score-0.316]
30 The other action, r (reset), causes a jump to the reset state irrespective of the current state. [sent-52, score-0.259]
31 The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. [sent-53, score-0.357]
32 Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0. [sent-54, score-0.195]
33 5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. [sent-62, score-0.752]
34 The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. [sent-64, score-0.361]
35 A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. [sent-66, score-0.051]
36 Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0. [sent-68, score-0.423]
37 ), which is always a sufficient statistic to predict the next pair in the sequence. [sent-74, score-0.114]
38 In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). [sent-76, score-0.496]
39 To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. [sent-79, score-0.239]
40 Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. [sent-80, score-0.241]
41 We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. [sent-82, score-0.332]
42 They also explored construction and learning algorithms for discovering system structure. [sent-83, score-0.063]
43 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. [sent-84, score-0.271]
44 We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, . [sent-87, score-0.271]
45 , O£ being generated, in that order, given that actions aI, . [sent-90, score-0.128]
46 The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). [sent-94, score-0.157]
47 Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), . [sent-95, score-0.388]
48 ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le. [sent-98, score-0.379]
49 , if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. [sent-99, score-0.232]
50 In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. [sent-100, score-0.325]
51 Let Pi(h) denote the ith component of the prediction vector for some PSR. [sent-101, score-0.083]
52 lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. [sent-104, score-0.094]
53 We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. [sent-105, score-0.634]
54 Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. [sent-110, score-0.598]
55 Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. [sent-113, score-0.054]
56 The (1 x n) vector bo is an initial state distribution. [sent-114, score-0.2]
57 The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. [sent-115, score-0.395]
58 The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. [sent-116, score-0.527]
59 l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. [sent-117, score-0.318]
60 It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. [sent-118, score-0.255]
61 Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol . [sent-119, score-0.332]
62 It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. [sent-131, score-0.516]
63 Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. [sent-132, score-0.466]
64 We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. [sent-133, score-1.453]
65 The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. [sent-136, score-0.569]
66 Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. [sent-137, score-0.794]
67 Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. [sent-138, score-0.162]
68 The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. [sent-139, score-0.055]
69 By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. [sent-140, score-0.682]
70 Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. [sent-142, score-0.919]
71 Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. [sent-143, score-0.556]
72 Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. [sent-144, score-0.211]
73 If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. [sent-145, score-0.476]
74 This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. [sent-146, score-0.647]
75 Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. [sent-148, score-0.598]
76 Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. [sent-149, score-0.573]
77 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). [sent-150, score-0.189]
78 Of these, only fa and rO are are linearly independent. [sent-151, score-0.148]
79 Of the extensions of these, fOrO is the only one that is linearly independent of the other two. [sent-152, score-0.199]
80 The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. [sent-153, score-0.387]
81 No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. [sent-154, score-0.863]
82 We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. [sent-155, score-0.625]
83 For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. [sent-156, score-0.356]
84 The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U . [sent-157, score-0.224]
85 Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. [sent-165, score-0.211]
86 It derives predictions for each test in Q after taking action f. [sent-169, score-0.24]
87 Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. [sent-170, score-0.444]
88 This example illustrates that the projection vectors need not contain only positive entries. [sent-176, score-0.092]
89 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. [sent-177, score-0.598]
90 We conclude by summarizing the potential advantages (to be explored in future work): Learnability. [sent-179, score-0.063]
91 The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. [sent-180, score-0.128]
92 Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. [sent-182, score-0.208]
93 We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. [sent-184, score-0.112]
94 As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. [sent-190, score-0.282]
95 We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. [sent-191, score-0.166]
96 There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. [sent-193, score-0.23]
97 PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e. [sent-199, score-0.429]
98 In this case, particularly, they would constitute a powerful language for representing the state of complex environments. [sent-203, score-0.185]
99 Planning and acting in ' partially observable stochastic domains. [sent-234, score-0.074]
100 A survey of algorithmic methods for partially observable Markov decision processes. [sent-239, score-0.074]
wordName wordTfidf (topN-words)
[('pomdp', 0.471), ('psr', 0.38), ('tests', 0.332), ('psrs', 0.214), ('aot', 0.19), ('outcome', 0.188), ('linearly', 0.148), ('pomdps', 0.144), ('state', 0.143), ('histories', 0.132), ('actions', 0.128), ('action', 0.126), ('rivest', 0.119), ('reset', 0.116), ('predictive', 0.099), ('jaeger', 0.095), ('fs', 0.083), ('schapire', 0.071), ('float', 0.071), ('mao', 0.071), ('ro', 0.071), ('tlh', 0.071), ('kaelbling', 0.07), ('history', 0.069), ('ol', 0.066), ('dependent', 0.063), ('predictions', 0.06), ('recursively', 0.058), ('representations', 0.057), ('observations', 0.057), ('states', 0.057), ('grounded', 0.056), ('prediction', 0.056), ('projection', 0.056), ('minimal', 0.056), ('search', 0.055), ('test', 0.054), ('singh', 0.054), ('ft', 0.053), ('extensions', 0.051), ('representation', 0.051), ('uu', 0.05), ('markov', 0.049), ('observable', 0.048), ('chrisman', 0.047), ('foro', 0.047), ('lovejoy', 0.047), ('shatkay', 0.047), ('environment', 0.046), ('stream', 0.045), ('observation', 0.044), ('sufficient', 0.044), ('constitute', 0.042), ('statistic', 0.042), ('uncontrolled', 0.041), ('outcomes', 0.041), ('pratt', 0.041), ('typified', 0.041), ('artificial', 0.041), ('specific', 0.038), ('halts', 0.038), ('baum', 0.038), ('vectors', 0.036), ('explored', 0.036), ('flow', 0.035), ('ok', 0.035), ('defined', 0.034), ('returned', 0.033), ('dynamical', 0.032), ('sutton', 0.032), ('compact', 0.032), ('rightmost', 0.031), ('reinforcement', 0.03), ('bo', 0.03), ('inductive', 0.03), ('intelligence', 0.03), ('something', 0.029), ('record', 0.029), ('mccallum', 0.029), ('null', 0.029), ('thrun', 0.029), ('belief', 0.028), ('predict', 0.028), ('ao', 0.028), ('remember', 0.028), ('factored', 0.028), ('particularly', 0.027), ('past', 0.027), ('system', 0.027), ('constructing', 0.027), ('advantages', 0.027), ('en', 0.027), ('baxter', 0.027), ('littman', 0.027), ('looks', 0.027), ('discrete', 0.027), ('vector', 0.027), ('difficulty', 0.026), ('mdps', 0.026), ('partially', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 148 nips-2001-Predictive Representations of State
Author: Michael L. Littman, Richard S. Sutton
Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.
2 0.17870785 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter
Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3, 2 µu (y) min σ 2 = σ 2 − E E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.10077156 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
5 0.093113489 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.091765001 128 nips-2001-Multiagent Planning with Factored MDPs
7 0.091719881 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
8 0.086144567 59 nips-2001-Direct value-approximation for factored MDPs
9 0.085780844 126 nips-2001-Motivated Reinforcement Learning
10 0.079825744 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
11 0.066304594 183 nips-2001-The Infinite Hidden Markov Model
12 0.064639471 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
13 0.063634612 115 nips-2001-Linear-time inference in Hierarchical HMMs
14 0.059617527 121 nips-2001-Model-Free Least-Squares Policy Iteration
15 0.057191942 13 nips-2001-A Natural Policy Gradient
16 0.05657129 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
17 0.053937528 43 nips-2001-Bayesian time series classification
18 0.05306289 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
19 0.05304908 36 nips-2001-Approximate Dynamic Programming via Linear Programming
20 0.052877475 139 nips-2001-Online Learning with Kernels
topicId topicWeight
[(0, -0.16), (1, -0.072), (2, 0.172), (3, -0.022), (4, -0.033), (5, 0.009), (6, 0.029), (7, 0.017), (8, -0.034), (9, -0.012), (10, -0.027), (11, 0.013), (12, -0.014), (13, 0.046), (14, 0.009), (15, -0.04), (16, 0.088), (17, 0.041), (18, 0.11), (19, -0.006), (20, 0.049), (21, 0.001), (22, 0.025), (23, 0.06), (24, -0.014), (25, -0.081), (26, 0.057), (27, 0.044), (28, -0.02), (29, -0.083), (30, 0.042), (31, 0.185), (32, -0.076), (33, -0.078), (34, -0.064), (35, -0.116), (36, 0.178), (37, 0.043), (38, 0.128), (39, -0.048), (40, -0.052), (41, -0.112), (42, 0.015), (43, -0.084), (44, 0.029), (45, 0.018), (46, -0.041), (47, -0.063), (48, 0.174), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.93279809 148 nips-2001-Predictive Representations of State
Author: Michael L. Littman, Richard S. Sutton
Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.
2 0.70044076 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter
Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3, 2 µu (y) min σ 2 = σ 2 − E E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.
3 0.65619892 126 nips-2001-Motivated Reinforcement Learning
Author: Peter Dayan
Abstract: The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.
4 0.58271027 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
6 0.51112694 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
7 0.50719893 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.48635098 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
9 0.39321724 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
10 0.37169558 43 nips-2001-Bayesian time series classification
11 0.3700684 115 nips-2001-Linear-time inference in Hierarchical HMMs
12 0.3575303 128 nips-2001-Multiagent Planning with Factored MDPs
13 0.34884959 40 nips-2001-Batch Value Function Approximation via Support Vectors
14 0.34435427 183 nips-2001-The Infinite Hidden Markov Model
15 0.33403182 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
16 0.33324879 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
17 0.31849119 13 nips-2001-A Natural Policy Gradient
18 0.30847681 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
19 0.30211863 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation
20 0.29868957 177 nips-2001-Switch Packet Arbitration via Queue-Learning
topicId topicWeight
[(17, 0.024), (19, 0.022), (27, 0.074), (30, 0.039), (38, 0.017), (59, 0.013), (72, 0.037), (79, 0.033), (83, 0.021), (91, 0.626)]
simIndex simValue paperId paperTitle
1 0.99306232 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
Author: James M. Coughlan, Alan L. Yuille
Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1
2 0.99224341 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway
Author: Gal Chechik, Amir Globerson, M. J. Anderson, E. D. Young, Israel Nelken, Naftali Tishby
Abstract: The way groups of auditory neurons interact to code acoustic information is investigated using an information theoretic approach. We develop measures of redundancy among groups of neurons, and apply them to the study of collaborative coding efficiency in two processing stations in the auditory pathway: the inferior colliculus (IC) and the primary auditory cortex (AI). Under two schemes for the coding of the acoustic content, acoustic segments coding and stimulus identity coding, we show differences both in information content and group redundancies between IC and AI neurons. These results provide for the first time a direct evidence for redundancy reduction along the ascending auditory pathway, as has been hypothesized for theoretical considerations [Barlow 1959,2001]. The redundancy effects under the single-spikes coding scheme are significant only for groups larger than ten cells, and cannot be revealed with the redundancy measures that use only pairs of cells. The results suggest that the auditory system transforms low level representations that contain redundancies due to the statistical structure of natural stimuli, into a representation in which cortical neurons extract rare and independent component of complex acoustic signals, that are useful for auditory scene analysis. 1
3 0.98905158 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task
Author: Michael C. Mozer, Michael D. Colagrosso, David E. Huber
Abstract: We are interested in the mechanisms by which individuals monitor and adjust their performance of simple cognitive tasks. We model a speeded discrimination task in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). Response conflict arises when one stimulus class is infrequent relative to another, resulting in more errors and slower reaction times for the infrequent class. How do control processes modulate behavior based on the relative class frequencies? We explain performance from a rational perspective that casts the goal of individuals as minimizing a cost that depends both on error rate and reaction time. With two additional assumptions of rationality—that class prior probabilities are accurately estimated and that inference is optimal subject to limitations on rate of information transmission—we obtain a good fit to overall RT and error data, as well as trial-by-trial variations in performance. Consider the following scenario: While driving, you approach an intersection at which the traffic light has already turned yellow, signaling that it is about to turn red. You also notice that a car is approaching you rapidly from behind, with no indication of slowing. Should you stop or speed through the intersection? The decision is difficult due to the presence of two conflicting signals. Such response conflict can be produced in a psychological laboratory as well. For example, Stroop (1935) asked individuals to name the color of ink on which a word is printed. When the words are color names incongruous with the ink color— e.g., “blue” printed in red—reaction times are slower and error rates are higher. We are interested in the control mechanisms underlying performance of high-conflict tasks. Conflict requires individuals to monitor and adjust their behavior, possibly responding more slowly if errors are too frequent. In this paper, we model a speeded discrimination paradigm in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). The stimuli are letters of the alphabet, A–Z, presented in rapid succession. In a choice task, individuals are asked to press one response key if the letter is an X or another response key for any letter other than X (as a shorthand, we will refer to non-X stimuli as Y). In a go/no-go task, individuals are asked to press a response key when X is presented and to make no response otherwise. We address both tasks because they elicit slightly different decision-making behavior. In both tasks, Jones and Braver (2001) manipulated the relative frequency of the X and Y stimuli; the ratio of presentation frequency was either 17:83, 50:50, or 83:17. Response conflict arises when the two stimulus classes are unbalanced in frequency, resulting in more errors and slower reaction times. For example, when X’s are frequent but Y is presented, individuals are predisposed toward producing the X response, and this predisposition must be overcome by the perceptual evidence from the Y. Jones and Braver (2001) also performed an fMRI study of this task and found that anterior cingulate cortex (ACC) becomes activated in situations involving response conflict. Specifically, when one stimulus occurs infrequently relative to the other, event-related fMRI response in the ACC is greater for the low frequency stimulus. Jones and Braver also extended a neural network model of Botvinick, Braver, Barch, Carter, and Cohen (2001) to account for human performance in the two discrimination tasks. The heart of the model is a mechanism that monitors conflict—the posited role of the ACC—and adjusts response biases accordingly. In this paper, we develop a parsimonious alternative account of the role of the ACC and of how control processes modulate behavior when response conflict arises. 1 A RATIONAL ANALYSIS Our account is based on a rational analysis of human cognition, which views cognitive processes as being optimized with respect to certain task-related goals, and being adaptive to the structure of the environment (Anderson, 1990). We make three assumptions of rationality: (1) perceptual inference is optimal but is subject to rate limitations on information transmission, (2) response class prior probabilities are accurately estimated, and (3) the goal of individuals is to minimize a cost that depends both on error rate and reaction time. The heart of our account is an existing probabilistic model that explains a variety of facilitation effects that arise from long-term repetition priming (Colagrosso, in preparation; Mozer, Colagrosso, & Huber, 2000), and more broadly, that addresses changes in the nature of information transmission in neocortex due to experience. We give a brief overview of this model; the details are not essential for the present work. The model posits that neocortex can be characterized by a collection of informationprocessing pathways, and any act of cognition involves coordination among pathways. To model a simple discrimination task, we might suppose a perceptual pathway to map the visual input to a semantic representation, and a response pathway to map the semantic representation to a response. The choice and go/no-go tasks described earlier share a perceptual pathway, but require different response pathways. The model is framed in terms of probability theory: pathway inputs and outputs are random variables and microinference in a pathway is carried out by Bayesian belief revision. To elaborate, consider a pathway whose input at time is a discrete random variable, denoted , which can assume values corresponding to alternative input states. Similarly, the output of the pathway at time is a discrete random variable, denoted , which can assume values . For example, the input to the perceptual pathway in the discrimination task is one of visual patterns corresponding to the letters of the alphabet, and the output is one of letter identities. (This model is highly abstract: the visual patterns are enumerated, but the actual pixel patterns are not explicitly represented in the model. Nonetheless, the similarity structure among inputs can be captured, but we skip a discussion of this issue because it is irrelevant for the current work.) To present a particular input alternative, , to the model for time steps, we clamp for . The model computes a probability distribution over given , i.e., P . ¡ # 4 0 ©2' & 0 ' ! 1)(
same-paper 4 0.98848814 148 nips-2001-Predictive Representations of State
Author: Michael L. Littman, Richard S. Sutton
Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.
5 0.98021168 107 nips-2001-Latent Dirichlet Allocation
Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1
6 0.96749693 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation
7 0.94883794 144 nips-2001-Partially labeled classification with Markov random walks
8 0.92488158 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
9 0.89418113 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds
10 0.86183596 30 nips-2001-Agglomerative Multivariate Information Bottleneck
11 0.86092913 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
12 0.8387773 24 nips-2001-Active Information Retrieval
13 0.8238343 183 nips-2001-The Infinite Hidden Markov Model
14 0.81889367 68 nips-2001-Entropy and Inference, Revisited
15 0.81802142 96 nips-2001-Information-Geometric Decomposition in Spike Analysis
16 0.81593072 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement
17 0.81293225 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
18 0.81150085 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
19 0.80901313 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
20 0.80194157 51 nips-2001-Cobot: A Social Reinforcement Learning Agent