nips nips2004 nips2004-159 knowledge-graph by maker-knowledge-mining

159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models


Source: pdf

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. [sent-7, score-0.172]

2 It includes methods for finding and using hidden state to make predictions more accurate. [sent-8, score-0.165]

3 We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. [sent-9, score-1.838]

4 These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. [sent-10, score-0.909]

5 Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. [sent-11, score-1.167]

6 1 Introduction Schema learning1 is a data-driven, constructivist approach for discovering probabilistic action models in dynamic controlled systems. [sent-13, score-0.154]

7 A schema predicts how specific sensor values will change as different actions are executed from within particular sensory contexts. [sent-15, score-0.898]

8 The learning mechanism also discovers hidden state features in order to make schema predictions more accurate. [sent-16, score-0.965]

9 In this work we have generalized and extended Drescher’s original mechanism to learn more accurate predictions by using improved criteria both for discovery and refinement of schemas as well as for creation and maintenance of hidden state. [sent-17, score-0.797]

10 While Drescher’s work included mechanisms for action selection, here we focus exclusively on the problem of learning schemas and hidden state to accurately model the world. [sent-18, score-0.623]

11 In several benchmark POMDPs, we show that our extended schema learner produces significantly better action models than the original. [sent-19, score-1.087]

12 We also show that the extended learner performs well on a complex, noisy speech modeling task, and that its prediction accuracy is approximately equal to that of predictive state representations [2] on a set of POMDPs, with faster convergence. [sent-20, score-0.44]

13 1 This use of the term schema derives from Piaget’s usage in the 1950s; it bears no relation to database schemas or other uses of the term. [sent-21, score-1.128]

14 2 Schema Learning Schema learning is a process of constructing probabilistic action models of the environment so that the effects of agent actions can be predicted. [sent-22, score-0.25]

15 Formally, a schema learner is fitted with a set of sensors S = {s1 , s2 , . [sent-23, score-0.929]

16 As it observes the effects of its actions on the environment, the learner constructs predictive units of sensorimotor cause and effect called schemas. [sent-31, score-0.289]

17 A a schema C −i R essentially says, “If I take action ai in situation C, I will see result R. [sent-32, score-0.901]

18 , cn } , which is a set of sensor conditions ci ≡ sk that must hold for the schema to be applicable, (2) the action j that is taken, and (3) the result, which is a set of sensor conditions R = {r1 , r2 , . [sent-36, score-1.12]

19 A schema is said to be applicable if its context conditions are satisfied, activated if it is applicable and its action is taken, and to succeed if it is activated and the predicted result is observed. [sent-40, score-1.119]

20 → Note that schemas are not rules telling an agent what to do; rather, they are descriptions of what will happen if the agent takes a particular action in a specific circumstance. [sent-42, score-0.52]

21 Also note that schema learning has no predefined states such as those found in a POMDP or HMM; the set of sensor readings is the state. [sent-43, score-0.866]

22 Because one schema’s result can set up another schema’s context, schemas fit naturally into a planning paradigm in which they are chained from the current situation to reach sensor-defined goals. [sent-44, score-0.398]

23 1 Discovery and Refinement Schema learning comprises two basic phases: discovery, in which context-free action/result schemas are found, and refinement, in which context is added to increase reliability. [sent-46, score-0.403]

24 In discovery, statistics track the influence of each action ai on each sensor condition sj . [sent-47, score-0.306]

25 r Drescher’s original schema mechanism accommodated only binary-valued sensors, but we have generalized it to allow a heterogeneous set of sensors that take on arbitrary discrete values. [sent-48, score-0.867]

26 In the present work, we assume that the effects of actions are observed on the subsequent timestep, which leads to the following criterion for discovering action effects: count(at , sj r(t+1) ) > θd , (1) where θd is a noise-filtering threshold. [sent-49, score-0.302]

27 If this criterion is met, the learner constructs a a schema ∅ −i sj , where the empty set, ∅, means that the schema is applicable in any situ→ r ation. [sent-50, score-1.756]

28 This works in a POMDP because it means that executing ai in some state has caused sensor sr to give observation j, implying that such a transition exists in the underlying (but unknown) system model. [sent-51, score-0.223]

29 Experiments in worlds of known structure show that this criterion misses many true action effects. [sent-54, score-0.15]

30 When a schema is discovered, it has no context, so its reliability may be low if the effect occurs only in particular situations. [sent-55, score-0.805]

31 Once the criterion is met, a child schema C ′ −i R is formed, where C ′ = C ∪sj . [sent-60, score-0.786]

32 2 Synthetic Items In addition to basic discovery and refinement of schemas, a schema learner also discovers hidden state. [sent-62, score-1.007]

33 Consider the case where no context conditions are found to make a schema reliable. [sent-63, score-0.834]

34 The schema learner therefore creates a new binary-valued virtual sensor, called a synthetic item, to represent the presence of conditions in the environment that allow the schema to succeed. [sent-65, score-1.881]

35 This addresses the state aliasing problem by splitting the state space into two parts, one where the schema succeeds, and one where it does not. [sent-66, score-0.927]

36 Synthetic items are said to reify the host schemas whose success conditions they represent; they have value 1 if the host schema would succeed if activated, and value 0 otherwise. [sent-67, score-1.36]

37 Upon creation, a synthetic item begins to act as a normal sensor, with one exception: the agent has no way of directly perceiving its value. [sent-68, score-0.384]

38 Creation and state maintenance criteria thus emerge as the main problems associated with synthetic items. [sent-69, score-0.369]

39 Drescher originally posited two conditions for the creation of a synthetic item: (1) a schema must be unreliable, and (2) the schema must be locally consistent, meaning that if it succeeds once, it has a high probability of succeeding again if activated soon afterward. [sent-70, score-1.843]

40 Our criterion for creating a synthetic items is 0 < Rel(C −i R) < θr , subject to the constraint that the statistics → governing possible additional context conditions have converged. [sent-74, score-0.384]

41 When this criterion is met, a synthetic item is created and is thenceforth treated as a normal sensor, able to be incorporated into the contexts and results of other schemas. [sent-75, score-0.426]

42 A newly created synthetic item is grounded: it represents whatever conditions in the world allow the host schema to succeed when activated. [sent-76, score-1.245]

43 Thus, upon activation of the host schema, we retroactively know the state of the synthetic item at the time of activation (1 if the schema succeeded, 0 otherwise). [sent-77, score-1.29]

44 Because the synthetic item is treated as a sensor, we can Figure 1: Benchmark problems. [sent-78, score-0.378]

45 discover which previous actions led to each synthetic item state, and the synthetic item can come to be included as a result condition in new schemas. [sent-84, score-0.775]

46 Once we have reliable schemas that predict the state of a synthetic item, we can begin to know its state non-retroactively, without having to activate the host schema. [sent-85, score-0.792]

47 The synthetic item’s state can potentially be known just as well as that of the regular sensors, and its addition expands the state representation in just such a way as to make sensory predictions more reliable. [sent-86, score-0.379]

48 Predicted synthetic item state implicitly summarizes the relevant preceding history: it indicates that one of the schemas that predicts it was just activated. [sent-87, score-0.794]

49 If the predicting schema also has a synthetic item in its context, an additional step of history is implied. [sent-88, score-1.168]

50 Such chaining allows synthetic items to summarize arbitrary amounts of history without explicitly remembering any of it. [sent-89, score-0.325]

51 This use of schemas to predict synthetic item state is in contrast to [1], which relied on the average duration of synthetic item states in order to predict them. [sent-90, score-1.219]

52 Table 1 compares our extended schema learning criteria with Drescher’s original criteria. [sent-91, score-0.901]

53 3 Empirical Evaluation In order to test the advantages of the extended learning criteria, we compared four versions of schema learning. [sent-92, score-0.825]

54 The first two were basic learners that made no use of synthetic items, but discovered and refined schemas using our extended criteria in one case, and the direct generalizations of Drescher’s original criteria in the other. [sent-93, score-0.755]

55 The second pair added the extended and original synthetic item mechanisms, respectively, to the first pair. [sent-94, score-0.47]

56 2 The flip system is shown on the left in Figure 1; it features deterministic transitions, hidden state, and a null action that confounds simplistic history approaches to handling hidden state. [sent-97, score-0.273]

57 Finally, we use a modified float/reset system in which the f action from the two right-most states leads deterministically to their left neighbor; this reveals more about the hidden state structure. [sent-99, score-0.246]

58 To test predictive power, each schema learner, upon taking an action, uses the most reliable of all activated schemas to predict what the next value of each sensor will be. [sent-100, score-1.335]

59 If there is no activation of a reliable schema to predict the value of a particular sensor, its value is predicted to stay constant. [sent-101, score-0.839]

60 3 No learning parameters are changed over time; schemas stop being created when discovery and refinement criteria cease to generate them. [sent-104, score-0.466]

61 flip float/reset extended extended baseline original original baseline 0. [sent-110, score-0.338]

62 5 PREDICTION ERROR extended extended baseline original original baseline 0. [sent-113, score-0.338]

63 1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 0 10000 0 1000 2000 3000 4000 modified float/reset 6000 7000 8000 9000 10000 speech modeling weather predictor 2−context schema learner 3−context schema learner extended extended baseline original original baseline 0. [sent-122, score-2.172]

64 In the speech modeling graph, learning is stopped after approximately 4300 timesteps (shown by the vertical line), after which no schemas are added, though reliabilities continue to be updated. [sent-137, score-0.451]

65 Figure 2 displays results for this learner compared to a baseline weather predictor. [sent-160, score-0.225]

66 Large divergence in the curves corresponds to the creation of synthetic items and the discovery of schemas that predict synthetic item state. [sent-163, score-1.124]

67 In flip and modified float/reset, the extended schema learner reaches zero error, having a complete model of the hidden state, and outperforms all other learners, while the extended basic version outperforms both original learners. [sent-165, score-1.112]

68 In float/reset, all learners perform approximately equally, reflecting the fact that, given the hidden stochasticity of this system, the best schema for action r is one that, without reference to synthetic items, gives a prediction of 1. [sent-166, score-1.173]

69 Surprisingly, the original learner never significantly outperformed its baseline, and even performed worse than the baseline in flip. [sent-167, score-0.225]

70 This is accounted for by the duration-based maintenance of synthetic items, which causes the original learner to maintain transient synthetic item state longer than it should. [sent-168, score-0.891]

71 The speech modeling results show that schema learning can induce high-quality action models in a complex, noisy domain. [sent-170, score-0.922]

72 6% in the training and testing phases, respectively, demonstrating the importance of incremental specialization of schemas through context refinement. [sent-179, score-0.423]

73 All together, these results show that our extended schema learner produces better action models than the original, and can handle more complex domains. [sent-180, score-1.092]

74 Synthetic items are seen to effectively model hidden state, and prediction-based maintenance of synthetic item state is shown to be more accurate than duration-based maintenance in POMDPs. [sent-181, score-0.705]

75 Discovery of schemas is improved by our criterion, missing fewer legitimate schemas, and therefore producing more accurate predictions. [sent-182, score-0.356]

76 4 Comparison to Predictive State Representations Predictive state representations (PSRs; [2]), like schema learning, are based on grounded, sensorimotor predictions that uncover hidden state. [sent-184, score-0.968]

77 In a PSR, the environment state is represented as the probabilities that each of a set of core tests would yield its observations if its actions were executed. [sent-190, score-0.229]

78 A schema is similar to a one-step PSR test, and schema reliability roughly corresponds to the probability of a PSR test. [sent-194, score-1.559]

79 Schemas differ, however, in that they only specify context and result incrementally, incorporating incremental history via synthetic items, while PSR tests incorporate the complete history and full observations (i. [sent-195, score-0.404]

80 all sensor readings at once) into 4 A weather predictor always predicts that values will stay the same as they are presently. [sent-197, score-0.15]

81 00899 Schema Learning Steps 10, 000 10, 000 10, 000 30, 000 Table 3: Prediction error for PSRs and schema learning on several POMDPs. [sent-207, score-0.774]

82 Thus, PSRs are more useful as Markovian state for reinforcement learning, while schemas are useful for explicit planning. [sent-212, score-0.433]

83 Note that synthetic items and PSR core test probabilities both attempt to capture a sufficient history statistic without explicitly maintaining history. [sent-213, score-0.356]

84 We compared the predictive performance of PSRs with that of schema learning on some of the POMDPs from [5]. [sent-215, score-0.808]

85 One-step PSR core tests can be used to predict observations: as an action is taken, the probability of each observation is the probability of the one-step core test that uses the current action and terminates in that observation. [sent-216, score-0.339]

86 This allows us to evaluate PSR predictions using the same error measure (fraction of incorrect predictions) as in schema learning. [sent-218, score-0.811]

87 5 In our experiments, the extended schema learner was first allowed to learn until it reached an asymptotic minimum error (no longer than 30,000 steps). [sent-219, score-0.972]

88 Learning was then deactivated, and the schema learner and PSR each made predictions over a series of randomly chosen actions. [sent-220, score-0.918]

89 Learning PSR parameters required 1-10 million timesteps [5], while schema learning used no more than 30,000 steps. [sent-222, score-0.798]

90 Also, learning PSR parameters required access to the underlying POMDP [5], whereas schema learning relies solely on sensorimotor information. [sent-223, score-0.778]

91 5 Related Work Aside from PSRs, schema learning is also similar to older work in learning planning operators, most notably that of Wang [7], Gil [8], and Shen [9]. [sent-224, score-0.796]

92 Unlike schema learning, they make the strong assumption that the environment does not produce noisy observations. [sent-226, score-0.795]

93 Benson [10] gives his learner prior knowledge about action effects, and the learner finds conditions to make the effects reliable with some tolerance for noise. [sent-229, score-0.456]

94 [11] use regression trees to find regions of noisy, continuous sensor space that cause a specified action to vary in the degree of its effect. [sent-232, score-0.209]

95 This bears a strong resemblance to the history represented by chains of synthetic items, a connection that should be explored more fully. [sent-239, score-0.259]

96 6 Discussion and Future Work We have shown that our extended schema learner produces accurate action models for a variety of POMDP systems and for a complex speech modeling task. [sent-242, score-1.104]

97 The extended schema learner performs substantially better than the original, and compares favorably in predictive power to PSRs while appearing to learn much faster. [sent-243, score-1.006]

98 Building probabilistic goal-regression planning on top of the schemas is a logical next step; however, to succeed with real-world planning problems, we believe that we need to extend the learning mechanism in several ways. [sent-244, score-0.496]

99 For example, the schema learner must explicitly handle actions whose effects occur over an extended duration instead of after one timestep. [sent-245, score-1.078]

100 The learner should also be able to directly handle continuous-valued sensors. [sent-246, score-0.149]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('schema', 0.754), ('schemas', 0.356), ('synthetic', 0.188), ('item', 0.173), ('psr', 0.145), ('learner', 0.127), ('psrs', 0.119), ('action', 0.118), ('drescher', 0.095), ('rel', 0.094), ('sensor', 0.091), ('items', 0.084), ('state', 0.077), ('nement', 0.072), ('extended', 0.071), ('sj', 0.068), ('maintenance', 0.066), ('pomdp', 0.063), ('timestep', 0.062), ('baseline', 0.06), ('creation', 0.057), ('discovery', 0.056), ('predictive', 0.054), ('actions', 0.053), ('history', 0.053), ('pomdps', 0.053), ('host', 0.052), ('hidden', 0.051), ('reliability', 0.051), ('sensors', 0.048), ('context', 0.047), ('ip', 0.045), ('timesteps', 0.044), ('planning', 0.042), ('prob', 0.039), ('criteria', 0.038), ('original', 0.038), ('aaai', 0.038), ('weather', 0.038), ('activated', 0.038), ('predictions', 0.037), ('prediction', 0.036), ('constructivist', 0.036), ('shen', 0.036), ('utree', 0.036), ('speech', 0.034), ('conditions', 0.033), ('criterion', 0.032), ('od', 0.031), ('core', 0.031), ('effects', 0.031), ('ai', 0.029), ('succeed', 0.029), ('mechanism', 0.027), ('re', 0.027), ('sr', 0.026), ('learners', 0.026), ('representations', 0.025), ('grounded', 0.025), ('environment', 0.025), ('atlanta', 0.024), ('balac', 0.024), ('benson', 0.024), ('georgia', 0.024), ('strips', 0.024), ('sensorimotor', 0.024), ('observations', 0.024), ('operators', 0.023), ('agent', 0.023), ('activation', 0.023), ('met', 0.023), ('predict', 0.022), ('handle', 0.022), ('applicable', 0.021), ('mechanisms', 0.021), ('readings', 0.021), ('vowel', 0.021), ('isbell', 0.021), ('gil', 0.021), ('transitions', 0.02), ('error', 0.02), ('predicted', 0.02), ('duration', 0.02), ('incremental', 0.02), ('reliable', 0.02), ('discovers', 0.019), ('aliasing', 0.019), ('succeeds', 0.019), ('tests', 0.019), ('xp', 0.018), ('annealed', 0.018), ('bears', 0.018), ('maintain', 0.017), ('treated', 0.017), ('benchmark', 0.017), ('stopped', 0.017), ('transient', 0.017), ('created', 0.016), ('noisy', 0.016), ('japanese', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

2 0.075056009 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

Author: Guy Shani, Ronen I. Brafman

Abstract: Agents learning to act in a partially observable domain may need to overcome the problem of perceptual aliasing – i.e., different states that appear similar but require different responses. This problem is exacerbated when the agent’s sensors are noisy, i.e., sensors may produce different observations in the same state. We show that many well-known reinforcement learning methods designed to deal with perceptual aliasing, such as Utile SufÄ?Ĺš x Memory, Ä?Ĺš nite size history windows, eligibility traces, and memory bits, do not handle noisy sensors well. We suggest a new algorithm, Noisy Utile SufÄ?Ĺš x Memory (NUSM), based on USM, that uses a weighted classiÄ?Ĺš cation of observed trajectories. We compare NUSM to the above methods and show it to be more robust to noise.

3 0.067412391 64 nips-2004-Experts in a Markov Decision Process

Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour

Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1

4 0.061731506 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as fixed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the first observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive definition of Node 2’s prediction is the probability that the first observation bit will be 1 two time steps from now. Node 3 similarly predicts the first observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the first observation bit and Node 4’s prediction, both on the next step. This is the first case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive definitions are difficult to express but which would nevertheless be completely defined as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modifiable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the defining TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to defining u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we define it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clarified by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed fixed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is specified by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved significant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not find that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufficiently small), which greatly simplifies comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the final predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a significant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically significant. TD methods appear to have a significant data-efficiency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the figure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically significant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always find such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the first few. Our results show that even in a fully observable setting there may be significant advantages to TD methods when learning TD-defined predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-defined predictions. Although most of our experiments have concerned the representational expressiveness and efficiency of TD-defined predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–121.

5 0.060325213 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

6 0.045301095 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity

7 0.043447338 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

8 0.040967397 24 nips-2004-Approximately Efficient Online Mechanism Design

9 0.040062886 88 nips-2004-Intrinsically Motivated Reinforcement Learning

10 0.039419409 136 nips-2004-On Semi-Supervised Classification

11 0.037357654 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

12 0.035698265 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

13 0.033693939 33 nips-2004-Brain Inspired Reinforcement Learning

14 0.03353082 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

15 0.032799546 39 nips-2004-Coarticulation in Markov Decision Processes

16 0.030618707 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

17 0.030141028 48 nips-2004-Convergence and No-Regret in Multiagent Learning

18 0.029618546 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

19 0.028846012 124 nips-2004-Multiple Alignment of Continuous Time Series

20 0.028549543 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.093), (1, -0.015), (2, 0.095), (3, -0.047), (4, -0.058), (5, 0.057), (6, 0.012), (7, 0.023), (8, -0.032), (9, -0.006), (10, -0.002), (11, 0.014), (12, -0.013), (13, 0.028), (14, 0.01), (15, -0.009), (16, -0.006), (17, -0.027), (18, 0.019), (19, -0.058), (20, -0.023), (21, 0.078), (22, -0.04), (23, 0.034), (24, -0.017), (25, 0.038), (26, 0.003), (27, 0.051), (28, 0.093), (29, 0.028), (30, 0.015), (31, -0.003), (32, -0.007), (33, 0.03), (34, -0.028), (35, -0.022), (36, 0.012), (37, -0.072), (38, -0.041), (39, -0.038), (40, 0.02), (41, -0.045), (42, 0.016), (43, -0.004), (44, -0.085), (45, 0.152), (46, -0.005), (47, -0.066), (48, 0.004), (49, -0.138)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94575524 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

2 0.6874631 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

Author: Guy Shani, Ronen I. Brafman

Abstract: Agents learning to act in a partially observable domain may need to overcome the problem of perceptual aliasing – i.e., different states that appear similar but require different responses. This problem is exacerbated when the agent’s sensors are noisy, i.e., sensors may produce different observations in the same state. We show that many well-known reinforcement learning methods designed to deal with perceptual aliasing, such as Utile SufÄ?Ĺš x Memory, Ä?Ĺš nite size history windows, eligibility traces, and memory bits, do not handle noisy sensors well. We suggest a new algorithm, Noisy Utile SufÄ?Ĺš x Memory (NUSM), based on USM, that uses a weighted classiÄ?Ĺš cation of observed trajectories. We compare NUSM to the above methods and show it to be more robust to noise.

3 0.6705789 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as fixed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the first observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive definition of Node 2’s prediction is the probability that the first observation bit will be 1 two time steps from now. Node 3 similarly predicts the first observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the first observation bit and Node 4’s prediction, both on the next step. This is the first case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive definitions are difficult to express but which would nevertheless be completely defined as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modifiable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the defining TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to defining u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we define it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clarified by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed fixed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is specified by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved significant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not find that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufficiently small), which greatly simplifies comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the final predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a significant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically significant. TD methods appear to have a significant data-efficiency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the figure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically significant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always find such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the first few. Our results show that even in a fully observable setting there may be significant advantages to TD methods when learning TD-defined predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-defined predictions. Although most of our experiments have concerned the representational expressiveness and efficiency of TD-defined predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–121.

4 0.5875513 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity

Author: Maxim Likhachev, Sebastian Thrun, Geoffrey J. Gordon

Abstract: Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (d) Planning map (b) Robotic helicopter (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2# of unknown places . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section. 2 The Algorithm Our algorithm can be broken into two levels. At a high level, it constructs a compressed MDP, denoted M c , which contains only the start, the goal, and some states which are outcomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to find new information to put into M c . This information includes newly-discovered stochastic actions and their outcomes; better deterministic paths from one place to another; and more accurate value estimates similar to Bellman backups. The deterministic searches can use an admissible heuristic h to focus their effort, so we can often avoid putting many irrelevant actions into M c . Because M c will often be much smaller than M , we can afford to run stochastic planning algorithms like value iteration on it. On the other hand, the information we get by planning in M c will improve the heuristic values that we use in our deterministic searches; so, the deterministic searches will tend to visit only relevant portions of the state space. 2.1 Constructing and Solving a Compressed MDP Each action in the compressed MDP represents several consecutive actions in M : if we see a sequence of states and actions s1 , a1 , s2 , a2 , . . . , sk , ak where a1 through ak−1 are deterministic but ak is stochastic, then we can represent it in M c with a single action a, available at s1 , whose outcome distribution is P (s | sk , ak ) and whose cost is k−1 c(si , ai , si+1 ) + c(sk , ak , s ) c(s1 , a, s ) = i=1 (See Figure 2(a) for an example of such a compressed action.) In addition, if we see a sequence of deterministic actions ending in sgoal , say s1 , a1 , s2 , a2 , . . . , sk , ak , sk+1 = sgoal , k we can define a compressed action which goes from s1 to sgoal at cost i=1 c(si , ai , si+1 ). We can label each compressed action that starts at s with (s, s , a) (where a = null if s = sgoal ). Among all compressed actions starting at s and ending at (s , a) there is (at least) one with lowest expected cost; we will call such an action an optimal compression of (s, s , a). Write Astoch for the set of all pairs (s, a) such that action a when taken from state s has more than one possible outcome, and include as well (sgoal , null). Write Sstoch for the states which are possible outcomes of the actions in Astoch , and include sstart as well. If we include in our compressed MDP an optimal compression of (s, s , a) for every s ∈ Sstoch and every (s , a) ∈ Astoch , the result is what we call the full compressed MDP; an example is in Figure 2(b). If we solve the full compressed MDP, the value of each state will be the same as the value of the corresponding state in M . However, we do not need to do that much work: (a) action compression (b) full MDP compression (c) incremental MDP compression Figure 2: MDP compression Main() 01 initialize M c with sstart and sgoal and set their v-values to 0; 02 while (∃s ∈ M c s.t. RHS(s) − v(s) > δ and s belongs to the current greedy policy) 03 select spivot to be any such state s; 04 [v; vlim ] = Search(spivot ); 05 v(spivot ) = v; 06 set the cost c(spivot , a, sgoal ) of the limit action a from spivot to vlim ; ¯ ¯ 07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c ; Figure 3: MCP main loop many states and actions in the full compressed MDP are irrelevant since they do not appear in the optimal policy from sstart to sgoal . So, the goal of the MCP algorithm will be to construct only the relevant part of the compressed MDP by building M c incrementally. Figure 2(c) shows the incremental construction of a compressed MDP which contains all of the stochastic states and actions along an optimal policy in M . The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain only sstart and sgoal , and it sets v(sstart ) = v(sgoal ) = 0. It maintains the invariant that 0 ≤ v(s) ≤ v ∗ (s) for all s. On each iteration, MCP looks at the Bellman error of each of the states in M c . The Bellman error is v(s) − RHS(s), where RHS(s) = min RHS(s, a) a∈A(s) RHS(s, a) = Es ∈succ(s,a) (c(s, a, s ) + v(s )) By convention the min of an empty set is ∞, so an s which does not have any compressed actions yet is considered to have infinite RHS. MCP selects a state with negative Bellman error, spivot , and starts a search at that state. (We note that there exist many possible ways to select spivot ; for example, we can choose the state with the largest negative Bellman error, or the largest error when weighted by state visitation probabilities in the best policy in M c .) The goal of this search is to find a new compressed action a such that its RHS-value can provide a new lower bound on v ∗ (spivot ). This action can either decrease the current RHS(spivot ) (if a seems to be a better action in terms of the current v-values of action outcomes) or prove that the current RHS(spivot ) is valid. Since v(s ) ≤ v ∗ (s ), one way to guarantee that RHS(spivot , a) ≤ v ∗ (spivot ) is to compute an optimal compression of (spivot , s, a) for all s, a, then choose the one with the smallest RHS. A more sophisticated strategy is to use an A∗ search with appropriate safeguards to make sure we never overestimate the value of a stochastic action. MCP, however, uses a modified A∗ search which we will describe in the next section. As the search finds new compressed actions, it adds them and their outcomes to M c . It is allowed to initialize newly-added states to any admissible values. When the search returns, MCP sets v(spivot ) to the returned value. This value is at least as large as RHS(spivot ). Consequently, Bellman error for spivot becomes non-negative. In addition to the compressed action and the updated value, the search algorithm returns a “limit value” vlim (spivot ). These limit values allow MCP to run a standard MDP planning algorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm which guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond the smaller of vlim (s) and RHS(s) (Requirement A). For example, we could insert a fake “limit action” into M c which goes directly from spivot to sgoal at cost vlim (spivot ) (as we do on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for each backup a state with negative Bellman error. After updating M c from the result of the search and any optional planning, MCP begins again by looking for another state with a negative Bellman error. It repeats this process until there are no negative Bellman errors larger than δ. For small enough δ, this property guarantees that we will be able to find a good policy (see section 2.3). 2.2 Searching the MDP Efficiently The top level algorithm (Figure 3) repeatedly invokes a search method for finding trajectories from spivot to sgoal . In order for the overall algorithm to work correctly, there are several properties that the search must satisfy. First, the estimate v that search returns for the expected cost of spivot should always be admissible. That is, 0 ≤ v ≤ v ∗ (spivot ) (Property 1). Second, the estimate v should be no less than the one-step lookahead value of spivot in M c . That is, v ≥ RHS(spivot ) (Property 2). This property ensures that search either increases the value of spivot or finds additional (or improved) compressed actions. The third and final property is for the vlim value, and it is only important if MCP uses its optional planning step (line 07). The property is that v ≤ vlim ≤ v ∗ (spivot ) (Property 3). Here v ∗ (spivot ) denotes the minimum expected cost of starting at spivot , picking a compressed action not in M c , and acting optimally from then on. (Note that v ∗ can be larger than v ∗ if the optimal compressed action is already part of M c .) Property 3 uses v ∗ rather than v ∗ since the latter is not known while it is possible to compute a lower bound on the former efficiently (see below). One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can control the outcome of stochastic actions. However, this sort of search is highly optimistic and can bias the search towards improbable trajectories. Also, it can only use heuristics which are even more optimistic than it is: that is, h must be admissible with respect to the optimistic assumption of controlled outcomes. We therefore present a version of A*, called MCP-search (Figure 4), that is more efficient for our purposes. MCP-search finds the correct expected value for the first stochastic action it encounters on any given trajectory, and is therefore far less optimistic. And, MCP-search only requires heuristic values to be admissible with respect to v ∗ values, h(s) ≤ v ∗ (s). Finally, MCP-search speeds up repetitive searches by improving heuristic values based on previous searches. A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN queue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears to be on the shortest path from start to goal. During each expansion a state s is removed from OPEN and all the g-values of s’s successors are updated; if g(s ) is decreased for some state s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We use the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the triangle inequality. MCP is similar to A∗ , but the OPEN list can also contain state-action pairs {s, a} where a is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just ImproveHeuristic(s) 01 if s ∈ M c then h(s) = max(h(s), v(s)); 02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations; procedure fvalue({s, a}) 03 if s = null return ∞; 04 else if a = null return g(s) + h(s); 05 else return g(s) + max(h(s), Es ∈Succ(s,a) {c(s, a, s ) + h(s )}); CheckInitialize(s) 06 if s was accessed last in some previous search iteration 07 ImproveHeuristic(s); 08 if s was not yet initialized in the current search iteration 09 g(s) = ∞; InsertUpdateCompAction(spivot , s, a) 10 reconstruct the path from spivot to s; 11 insert compressed action (spivot , s, a) into A(spivot ) (or update the cost if a cheaper path was found) 12 for each outcome u of a that was not in M c previously 13 set v(u) to h(u) or any other value less than or equal to v ∗ (u); 14 set the cost c(u, a, sgoal ) of the limit action a from u to v(u); ¯ ¯ procedure Search(spivot ) 15 CheckInitialize(sgoal ), CheckInitialize(spivot ); 16 g(spivot ) = 0; 17 OPEN = {{spivot , null}}; 18 {sbest , abest } = {null, null}, f best = ∞; 19 while(g(sgoal ) > min{s,a}∈OPEN (fvalue({s, a})) AND f best + θ > min{s,a}∈OPEN (fvalue({s, a}))) 20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null; 21 if a = null //expand state s 22 for each s ∈ Succ(s) 23 CheckInitialize(s ); 24 for each deterministic a ∈ A(s) 25 s = Succ(s, a ); 26 h(s ) = max(h(s ), h(s) − c(s, a , s )); 27 if g(s ) > g(s) + c(s, a , s ) 28 g(s ) = g(s) + c(s, a , s ); 29 insert/update {s , null} into OPEN with fvalue({s , null}); 30 for each stochastic a ∈ A(s) 31 insert/update {s, a } into OPEN with fvalue({s, a }); 32 else //encode stochastic action a from state s as a compressed action from spivot 33 InsertUpdateCompAction(spivot , s, a); 34 if f best > fvalue({s, a}) then {sbest , abest } = {s, a}, f best = fvalue({s, a}); 35 if (g(sgoal ) ≤ min{s,a}∈OPEN (fvalue({s, a})) AND OPEN = ∅) 36 reconstruct the path from spivot to sgoal ; 37 update/insert into A(spivot ) a deterministic action a leading to sgoal ; 38 if f best ≥ g(sgoal ) then {sbest , abest } = {sgoal , null}, f best = g(sgoal ); 39 return [f best; min{s,a}∈OPEN (fvalue({s, a}))]; Figure 4: MCP-search Algorithm like A*, MCP-search expands elements in the order of increasing f -values, but it breaks ties towards elements encoding plain states (line 20). The f -value of {s, a} is defined as g(s) + max[h(s), Es ∈Succ(s,a) (c(s, a, s ) + h(s ))] (line 05). This f -value is a lower bound on the cost of a policy that goes from sstart to sgoal by first executing a series of deterministic actions until action a is executed from state s. This bound is as tight as possible given our heuristic values. State expansion (lines 21-31) is very similar to A∗ . When the search removes from OPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line 33). It also adds a compressed action if there is an optimal deterministic path to sgoal (line 37). f best tracks the minimum f -value of all the compressed actions found. As a result, f best ≤ v ∗ (spivot ) and is used as a new estimate for v(spivot ). The limit value vlim (spivot ) is obtained by continuing the search until the minimum f -value of elements in OPEN approaches f best + θ for some θ ≥ 0 (line 19). This minimum f -value then provides a lower bound on v ∗ (spivot ). To speed up repetitive searches, MCP-search improves the heuristic of every state that it encounters for the first time in the current search iteration (lines 01 and 02). Line 01 uses the fact that v(s) from M c is a lower bound on v ∗ (s). Line 02 uses the fact that f best − g(s) is a lower bound on v ∗ (s) at the end of each previous call to Search; for more details see [4]. 2.3 Theoretical Properties of the Algorithm We now present several theorems about our algorithm. The proofs of these and other theorems can be found in [4]. The first theorem states the main properties of MCP-search. Theorem 1 The search function terminates and the following holds for the values it returns: (a) if sbest = null then v ∗ (spivot ) ≥ f best ≥ E{c(spivot , abest , s ) + v(s )} (b) if sbest = null then v ∗ (spivot ) = f best = ∞ (c) f best ≤ min{s,a}∈OPEN (fvalue({s, a})) ≤ v ∗ (spivot ). If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says that there is no policy from spivot that has a finite expected cost. Using the above theorem it is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is returned as variable v and min{s,a}∈OPEN (fvalue({s, a})) is returned as variable vlim in the main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and (b) and the fact that costs are strictly positive and v-values are non-negative. Property 2 also follows trivially from (a) and (b). Property 3 follows from (c). Given these properties c the next theorem states the correctness of the outer MCP algorithm (in the theorem πgreedy denotes a greedy policy that always chooses an action that looks best based on its cost and the v-values of its immediate successors). Theorem 2 Given a deterministic search algorithm which satisfies Properties 1–3, the c MCP algorithm will terminate. Upon termination, for every state s ∈ M c ∩ πgreedy we ∗ have RHS(s) − δ ≤ v(s) ≤ v (s). Given the above theorem one can show that for 0 ≤ δ < cmin (where cmin is the c smallest expected action cost in our MDP) the expected cost of executing πgreedy from cmin ∗ sstart is at most cmin −δ v (sstart ). Picking δ ≥ cmin is not guaranteed to result in a proper policy, even though Theorem 2 continues to hold. 3 Experimental Study We have evaluated the MCP algorithm on the robot-helicopter coordination problem described in section 1. To obtain an admissible heuristic, we first compute a value function for every possible configuration of obstacles. Then we weight the value functions by the probabilities of their obstacle configurations, sum them, and add the cost of moving the helicopter back to its base if it is not already there. This procedure results in optimistic cost estimates because it pretends that the robot will find out the obstacle locations immediately instead of having to wait to observe them. The results of our experiments are shown in Figure 5. We have compared MCP against three algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP can cope with large size MDPs by focussing its planning efforts along simulated execution trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly performs dynamic programming on the states in its current partial policy. We have implemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the robot-helicopter problem with perfect sensing). VI was only able to run on the problems with perfect sensing since the number of reachable states was too large for the others. The results support the claim that MCP can solve large problems with sparse stochasticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times faster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for these problems, MCP computed values for 58633 states while M c grew to 396 states, and MCP encountered 3740 stochastic transitions (to give a sense of the degree of stochasticity). The main cost of MCP was in its deterministic search subroutine; this fact suggests that we might benefit from anytime search techniques such as ARA* [3]. The results for the problems with imperfect sensing show that, as the number and density of uncertain outcomes increases, the advantage of MCP decreases. For these problems MCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster than RTDP. On average MCP computed values for 127,442 states, while the size of M c was 3,713 states, and 24,052 stochastic transitions were encountered. Figure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect sensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column: running times (in secs) for each algorithm grouped by environments. Middle column: the number of backups for each algorithm grouped by environments. Right column: an estimate of the expected cost of an optimal policy (v(sstart )) vs. running time (in secs) for experiment (k) in the top row and experiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and VI (VI is only shown for problems with perfect sensing). The characteristics of the environments are given in the second and third rows under each of the bar plot. The second row indicates how many cells the 2D plane is discretized into, and the third row indicates the number of initially unknown cells in the environment. 4 Discussion The MCP algorithm incrementally builds a compressed MDP using a sequence of deterministic searches. Our experimental results suggest that MCP is advantageous for problems with sparse stochasticity. In particular, MCP has allowed us to scale to larger environments than were otherwise possible for the robot-helicopter coordination problem. Acknowledgements This research was supported by DARPA’s MARS program. All conclusions are our own. References [1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995. [2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129:35–62, 2001. [3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16. Cambridge, MA: MIT Press, 2003. [4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report, Carnegie Mellon University, Pittsburgh, PA, 2004. [5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence, 23:13–27, 1984. [6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing, 1980. [7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processses. Mathematics of Operations Research, 12(3):441–450, 1987.

5 0.53306949 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Policy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states.

6 0.51827788 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

7 0.43104595 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

8 0.41761738 39 nips-2004-Coarticulation in Markov Decision Processes

9 0.39280951 33 nips-2004-Brain Inspired Reinforcement Learning

10 0.35819918 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

11 0.34238696 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

12 0.33827552 64 nips-2004-Experts in a Markov Decision Process

13 0.33131409 74 nips-2004-Harmonising Chorales by Probabilistic Inference

14 0.32329208 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

15 0.31648308 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces

16 0.30725965 130 nips-2004-Newscast EM

17 0.29695266 50 nips-2004-Dependent Gaussian Processes

18 0.29444727 124 nips-2004-Multiple Alignment of Continuous Time Series

19 0.29359958 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

20 0.29234266 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.308), (13, 0.057), (15, 0.14), (26, 0.052), (31, 0.035), (33, 0.162), (35, 0.023), (39, 0.026), (50, 0.037), (65, 0.014), (71, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81161457 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

same-paper 2 0.79945755 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for finding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difficult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and find their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

3 0.73829383 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single configuration, on three standard data sets – one for frontal pose, one for rotated faces, and one for profiles – and find that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

4 0.5985294 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1

5 0.59693956 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

Author: Ruei-sung Lin, David A. Ross, Jongwoo Lim, Ming-Hsuan Yang

Abstract: This paper presents an adaptive discriminative generative model that generalizes the conventional Fisher Linear Discriminant algorithm and renders a proper probabilistic interpretation. Within the context of object tracking, we aim to find a discriminative generative model that best separates the target from the background. We present a computationally efficient algorithm to constantly update this discriminative model as time progresses. While most tracking algorithms operate on the premise that the object appearance or ambient lighting condition does not significantly change as time progresses, our method adapts a discriminative generative model to reflect appearance variation of the target and background, thereby facilitating the tracking task in ever-changing environments. Numerous experiments show that our method is able to learn a discriminative generative model for tracking target objects undergoing large pose and lighting changes.

6 0.59405577 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

7 0.59252667 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.59163249 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.59157717 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

10 0.59137064 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

11 0.59125668 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

12 0.5881654 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

13 0.58812493 161 nips-2004-Self-Tuning Spectral Clustering

14 0.58810127 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

15 0.58776748 69 nips-2004-Fast Rates to Bayes for Kernel Machines

16 0.5873819 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

17 0.58712584 131 nips-2004-Non-Local Manifold Tangent Learning

18 0.58675092 68 nips-2004-Face Detection --- Efficient and Rank Deficient

19 0.58572781 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

20 0.58563727 77 nips-2004-Hierarchical Clustering of a Mixture Model