nips nips2004 nips2004-183 knowledge-graph by maker-knowledge-mining

183 nips-2004-Temporal-Difference Networks

Source: pdf

Author: Richard S. Sutton, Brian Tanner

Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a ﬁxed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efﬁciency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efﬁcient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the ﬁrst. To be more precise, we must make a distinction between the extensive deﬁnition of a prediction, expressing its desired relationship to measurable data, and its TD deﬁnition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively deﬁned as an expectation of the discounted sum of future rewards, while they are TD deﬁned as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just deﬁned or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the ﬁrst two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a ﬁxed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, ﬁrst unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as ﬁxed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the ﬁrst observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive deﬁnition of Node 2’s prediction is the probability that the ﬁrst observation bit will be 1 two time steps from now. Node 3 similarly predicts the ﬁrst observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the ﬁrst observation bit and Node 4’s prediction, both on the next step. This is the ﬁrst case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive deﬁnitions are difﬁcult to express but which would nevertheless be completely deﬁned as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modiﬁable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the deﬁning TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to deﬁning u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we deﬁne it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clariﬁed by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed ﬁxed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is speciﬁed by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved signiﬁcant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not ﬁnd that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufﬁciently small), which greatly simpliﬁes comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the ﬁnal predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a signiﬁcant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically signiﬁcant. TD methods appear to have a signiﬁcant data-efﬁciency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the ﬁgure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically signiﬁcant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always ﬁnd such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the ﬁrst few. Our results show that even in a fully observable setting there may be signiﬁcant advantages to TD methods when learning TD-deﬁned predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-deﬁned predictions. Although most of our experiments have concerned the representational expressiveness and efﬁciency of TD-deﬁned predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence 112:181–121.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. [sent-5, score-0.853]

2 TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. [sent-6, score-0.357]

3 Using a random-walk example, we show that these networks can be used to learn to predict by a ﬁxed interval, which is not possible with conventional TD methods. [sent-7, score-0.194]

4 Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. [sent-9, score-0.249]

5 Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. [sent-11, score-0.139]

6 Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). [sent-12, score-0.402]

7 TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. [sent-15, score-0.56]

8 In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. [sent-17, score-0.171]

9 In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. [sent-19, score-0.367]

10 To be more precise, we must make a distinction between the extensive deﬁnition of a prediction, expressing its desired relationship to measurable data, and its TD deﬁnition, expressing its desired relationship to other predictions. [sent-21, score-0.169]

11 In reinforcement learning, for example, state values are extensively deﬁned as an expectation of the discounted sum of future rewards, while they are TD deﬁned as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). [sent-22, score-0.206]

12 In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. [sent-24, score-0.366]

13 In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. [sent-25, score-0.62]

14 At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. [sent-27, score-0.322]

15 The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. [sent-29, score-0.256]

16 For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, ﬁrst unconditionally and then conditioned on action sequences. [sent-39, score-0.514]

17 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. [sent-40, score-0.174]

18 The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. [sent-41, score-0.409]

19 These links determine the extensive semantics of each prediction—its desired or target relationship to the data. [sent-42, score-0.201]

20 They represent what we seek to predict about the data as opposed to how we try to predict it. [sent-43, score-0.146]

21 We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. [sent-44, score-0.138]

22 A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. [sent-45, score-0.549]

23 The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. [sent-47, score-0.324]

24 The node labeled 1 is directly connected to the ﬁrst observation bit and represents a prediction that that bit will be 1 on the next time step. [sent-51, score-0.563]

25 The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. [sent-52, score-0.28]

26 Thus the extensive deﬁnition of Node 2’s prediction is the probability that the ﬁrst observation bit will be 1 two time steps from now. [sent-53, score-0.469]

27 Node 3 similarly predicts the ﬁrst observation bit three time steps in the future. [sent-54, score-0.31]

28 Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. [sent-55, score-0.179]

29 Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). [sent-56, score-0.306]

30 Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. [sent-57, score-0.315]

31 Node 7 is a prediction of the average of the ﬁrst observation bit and Node 4’s prediction, both on the next step. [sent-58, score-0.316]

32 This is the ﬁrst case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. [sent-59, score-0.283]

33 Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. [sent-60, score-0.183]

34 One could continue in this way, adding more and more nodes whose extensive deﬁnitions are difﬁcult to express but which would nevertheless be completely deﬁned as long as these local TD relationships are clear. [sent-61, score-0.171]

35 The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. [sent-62, score-0.259]

36 In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. [sent-63, score-0.142]

37 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. [sent-64, score-0.142]

38 (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. [sent-65, score-0.322]

39 The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). [sent-67, score-0.325]

40 i More formally and generally, let yt ∈ [0, 1], i = 1, . [sent-69, score-0.2]

41 , n, denote the prediction of the 1 n ith node at time step t. [sent-72, score-0.233]

42 , yt )T is updated according to a vector-valued function u with modiﬁable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . [sent-76, score-0.511]

43 The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. [sent-79, score-0.648]

44 In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . [sent-80, score-0.395]

45 ) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . [sent-82, score-1.194]

46 The other part has to do with making them potentially conditional on action and observation. [sent-84, score-0.169]

47 For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. [sent-85, score-0.384]

48 To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . [sent-86, score-0.615]

49 Each t ci is determined as an arbitrary function ci of at and yt . [sent-87, score-0.289]

50 In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . [sent-88, score-0.242]

51 Let us now turn to deﬁning u, the update function for yt mentioned earlier and which corresponds to the answer network. [sent-91, score-0.25]

52 The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . [sent-93, score-0.3]

53 (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . [sent-94, score-0.354]

54 The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. [sent-95, score-0.473]

55 The timing details may be clariﬁed by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . [sent-96, score-0.414]

56 y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed ﬁxed. [sent-97, score-0.387]

57 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. [sent-98, score-0.287]

58 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. [sent-99, score-0.4]

59 In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc. [sent-100, score-0.29]

60 This is speciﬁed by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. [sent-102, score-0.341]

61 Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). [sent-103, score-0.207]

62 The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). [sent-105, score-0.235]

63 This involved signiﬁcant additional complexity to store the predictions until their corresponding targets were available. [sent-106, score-0.281]

64 Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). [sent-108, score-0.295]

65 Under these conditions we did not ﬁnd that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. [sent-110, score-0.263]

66 Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufﬁciently small), which greatly simpliﬁes comparison of algorithms. [sent-112, score-0.32]

67 The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). [sent-113, score-0.389]

68 The errors in the ﬁnal predictions are shown in Table 1. [sent-114, score-0.263]

69 For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a signiﬁcant difference was observed. [sent-115, score-0.263]

70 The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. [sent-116, score-0.18]

71 TD methods appear to have a signiﬁcant data-efﬁciency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). [sent-119, score-0.147]

72 049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. [sent-156, score-0.351]

73 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. [sent-157, score-0.387]

74 In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. [sent-158, score-0.526]

75 The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. [sent-159, score-0.178]

76 The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). [sent-160, score-0.655]

77 , the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. [sent-163, score-0.555]

78 These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). [sent-164, score-0.427]

79 In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. [sent-165, score-0.22]

80 The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the ﬁgure. [sent-166, score-0.259]

81 A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. [sent-169, score-0.303]

82 First the one-step predictions must be learned, then the two-step predictions from them, and so on. [sent-171, score-0.526]

83 As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. [sent-173, score-0.16]

84 Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. [sent-174, score-0.381]

85 Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. [sent-175, score-0.332]

86 Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. [sent-176, score-0.201]

87 000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. [sent-212, score-0.319]

88 14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. [sent-221, score-0.263]

89 In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. [sent-225, score-0.339]

90 In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. [sent-226, score-0.698]

91 As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. [sent-227, score-0.496]

92 The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . [sent-228, score-0.548]

93 The initial weights W0 and predictions y0 were both 0. [sent-231, score-0.263]

94 We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. [sent-233, score-0.574]

95 The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. [sent-245, score-0.335]

96 Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. [sent-247, score-0.214]

97 Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al. [sent-248, score-0.258]

98 TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. [sent-257, score-0.463]

99 Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. [sent-261, score-0.552]

100 Learning and discovery of predictive state representations in dynamical systems with reset. [sent-283, score-0.164]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('td', 0.769), ('predictions', 0.263), ('yt', 0.2), ('action', 0.15), ('zt', 0.142), ('bit', 0.136), ('singh', 0.125), ('ot', 0.111), ('prediction', 0.102), ('mc', 0.097), ('sutton', 0.097), ('node', 0.089), ('wt', 0.086), ('question', 0.083), ('rmse', 0.079), ('network', 0.078), ('observation', 0.078), ('extensive', 0.077), ('predictive', 0.07), ('state', 0.066), ('predict', 0.064), ('networks', 0.059), ('nodes', 0.058), ('carlo', 0.057), ('monte', 0.057), ('batch', 0.057), ('littman', 0.056), ('rudary', 0.055), ('steps', 0.054), ('answer', 0.05), ('advance', 0.049), ('conventional', 0.045), ('ct', 0.042), ('experiment', 0.04), ('temporal', 0.039), ('precup', 0.038), ('semantics', 0.038), ('actions', 0.037), ('depths', 0.037), ('lagoudakis', 0.037), ('xt', 0.037), ('ci', 0.036), ('relationships', 0.036), ('guess', 0.035), ('links', 0.034), ('discounted', 0.032), ('boyan', 0.032), ('unconditionally', 0.032), ('bradtke', 0.032), ('lengths', 0.031), ('sequence', 0.03), ('walk', 0.029), ('seven', 0.029), ('parr', 0.029), ('successor', 0.029), ('reinforcement', 0.029), ('preceding', 0.028), ('representations', 0.028), ('sought', 0.028), ('alberta', 0.027), ('experience', 0.027), ('reward', 0.026), ('target', 0.026), ('relationship', 0.026), ('learn', 0.026), ('updating', 0.026), ('grounded', 0.026), ('kaelbling', 0.026), ('sequences', 0.025), ('online', 0.025), ('future', 0.024), ('james', 0.023), ('barto', 0.023), ('time', 0.022), ('squares', 0.022), ('history', 0.022), ('ij', 0.021), ('learned', 0.021), ('actual', 0.021), ('accordingly', 0.021), ('step', 0.02), ('predicts', 0.02), ('expressing', 0.02), ('ciency', 0.019), ('possibilities', 0.019), ('depth', 0.019), ('bits', 0.019), ('advantages', 0.019), ('conditional', 0.019), ('explore', 0.019), ('targets', 0.018), ('dayan', 0.018), ('representing', 0.018), ('represent', 0.018), ('agent', 0.018), ('suggest', 0.018), ('feature', 0.018), ('entirely', 0.018), ('arbitrary', 0.017), ('learning', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

2 0.14856914 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efﬁciently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical ﬁndings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. 1

3 0.13614516 33 nips-2004-Brain Inspired Reinforcement Learning

Author: Françcois Rivest, Yoshua Bengio, John Kalaska

Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1

4 0.13375282 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik

Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1

5 0.12720817 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

6 0.12356246 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

7 0.11506338 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

8 0.10905001 48 nips-2004-Convergence and No-Regret in Multiagent Learning

9 0.086177178 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

10 0.081168868 64 nips-2004-Experts in a Markov Decision Process

11 0.073341034 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

12 0.070239536 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

13 0.065795735 155 nips-2004-Responding to Modalities with Different Latencies

14 0.061731506 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

15 0.060318399 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

16 0.058288157 88 nips-2004-Intrinsically Motivated Reinforcement Learning

17 0.057009429 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

18 0.056322381 58 nips-2004-Edge of Chaos Computation in Mixed-Mode VLSI - A Hard Liquid

19 0.055792779 175 nips-2004-Stable adaptive control with online learning

20 0.055153634 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, -0.056), (2, 0.242), (3, -0.015), (4, 0.056), (5, -0.066), (6, 0.024), (7, -0.048), (8, 0.017), (9, -0.055), (10, -0.012), (11, 0.045), (12, -0.1), (13, 0.002), (14, 0.108), (15, 0.02), (16, -0.019), (17, 0.011), (18, 0.037), (19, -0.081), (20, -0.121), (21, 0.067), (22, -0.045), (23, -0.016), (24, -0.126), (25, 0.005), (26, -0.081), (27, -0.014), (28, 0.206), (29, -0.037), (30, -0.029), (31, 0.067), (32, 0.065), (33, 0.082), (34, -0.005), (35, -0.215), (36, 0.062), (37, -0.15), (38, -0.062), (39, -0.094), (40, 0.105), (41, -0.069), (42, 0.016), (43, -0.027), (44, -0.034), (45, 0.157), (46, 0.022), (47, -0.003), (48, 0.035), (49, -0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97167355 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

2 0.70764035 33 nips-2004-Brain Inspired Reinforcement Learning

Author: Françcois Rivest, Yoshua Bengio, John Kalaska

3 0.65701324 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

Author: Michael P. Holmes, Charles Jr.

Abstract: Schema learning is a way to discover probabilistic, constructivist, predictive action models (schemas) from experience. It includes methods for ﬁnding and using hidden state to make predictions more accurate. We extend the original schema mechanism [1] to handle arbitrary discrete-valued sensors, improve the original learning criteria to handle POMDP domains, and better maintain hidden state by using schema predictions. These extensions show large improvement over the original schema mechanism in several rewardless POMDPs, and achieve very low prediction error in a difﬁcult speech modeling task. Further, we compare extended schema learning to the recently introduced predictive state representations [2], and ﬁnd their predictions of next-step action effects to be approximately equal in accuracy. This work lays the foundation for a schema-based system of integrated learning and planning. 1

4 0.51647794 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. On experimental evaluation, this algorithm shows superior performance over model selection. We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. A population-wide model is optimized to predict well on average when applied to expected future instances. In contrast, an instance-specific model is one that is constructed specifically for a particular instance. The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. In contrast, lazy learning defers most or all processing until a response to a test instance is required. Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. The rest of this paper is structured as follows. Section 2 introduces a Bayesian framework for instance-specific learning. Section 3 describes the implementation of ISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. Finally, in Section 5 we discuss the results of the comparison. 2 Deci si on Th eo ret i c F rame wo rk We use the following notation. Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. Thus, X = x denotes that variable X is assigned the value x. Bold upper case letters, such as X, Z, represent sets of variables or random vectors and their realization is denoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that the variables in X have the states given by x. In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. We now characterize population-wide and instance-specific model selection in decision theoretic terms. Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . (1) M The optimal population-wide model for predicting Zt is as follows:   max∑ U P( Z t | X t , D), P (Z t | X t , M ) P ( X | D) , M  Xt  [ ] (2) where the function U gives the utility of approximating the Bayes optimal estimate P(Zt | Xt , D), with the estimate P(Zt | Xt , M) obtained from model M. The term P(X | D) is given by: P ( X | D) = ∫ P ( X | M ) P ( M | D)dM . (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . The difference between the population-wide and the instance-specific models can be noted by comparing Equations 2 and 4. Equation 2 for the population-wide model selects the model that on average will have the greatest utility. Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. This observation provides support for developing instance-specific models. Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. The current paper focuses on model averaging, rather than model selection. Ideal Bayesian model averaging is given by Equation 1. Model averaging has previously been applied using population-wide models. Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. We introduce Bayesian networks briefly and then describe ISA in detail. 3.1 B ay e s i a n N e t w or k s A Bayesian network is a probabilistic model that combines a graphical representation (the Bayesian network structure) with quantitative information (the parameters of the Bayesian network) to represent the joint probability distribution over a set of random variables [3]. Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. ΘG represents the parameterization of the model. In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . The immediate successors of X i are called the children of X i . For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , ..., X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . If Xi has no parents, then the set parents(Xi ) is empty and P(Xi | parents(X i)) is just P(Xi ). 3.2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). We use the same class of Bayesian networks for the ISA models, to facilitate comparison between the two algorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. That is, each node is either a parent or a child of Z. Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. ... X1= x1 Xi = xi Z Xi+1 ... Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. The indices need not be ordered as shown, but are presented in this example for convenience of exposition. 3.3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . ∑RP(M | D) M∈ (7) Assuming uniform prior belief over all possible models, the model posterior P(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to obtain the following equation: P ( Z | x , D) ≅ t t ∑ P ( Z t | x t , M ) P( D | M ) . ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,..., x p ,z 1 ,...,z p −1 ,M ) . (9) p =1 This score is computed in a predictive and sequential fashion: for the pth training instance the probability of predicting the observed value zp for the target variable is computed based on the values of all the variables in the preceding p-1 training instances and the values xp of the attributes in the pth instance. One limitation of this score is that its value depends on the ordering of the data. Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. The parameters of the Bayesian network M, used in the above computations, are defined as follows: P ( X i = k | parents ( X i ) = j ) ≡ θ ijk = N ijk + α ijk N ij + α ij , (10) where (i) Nijk is the number of instances in the training dataset D where variable Xi has value k and the parents of X i are in state j, (ii) N ij = ∑k N ijk , (iii) αijk is a parameter prior that can be interpreted as the belief equivalent of having previously observed αijk instances in which variable Xi has value k and the parents of X i are in state j, and (iv) α ij = ∑k α ijk . 3.4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. The initial model is the simple Bayes network that includes all the attributes in X as children of Z. A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. This process is performed for each child in the current model. An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. The newly derived models are added to a priority queue, Q. During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. The first phase terminates after a user-specified number of models have accumulated in R. The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. The idea here is to find viable competing models for making this posterior probability prediction. When no competitive models can be found, the prediction becomes stable. During each iteration of the search, the highest ranked model M* is removed from Q and added to R. The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). More change is better. In particular, M* is the model in Q that maximizes the following function: f ( R, M *) = g ( R) − g ( R U {M *}) , (11) where for a set of models S, the function g(S) computes the approximate model averaged prediction for Zt, as follows: g (S ) = ∑ P(Z M ∈S t | x t , M ) score( D, M ) ∑ score( D, M ) ∈ . (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. The final distribution of Zt is then computed from the models in R using Equation 8. 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. For SB and LBR, we used the Weka implementations (Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We implemented the ISA algorithm as a standalone application in Java. The following settings were used for ISA: a maximum of 100 phase-1 models, a threshold T of 0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in Equation 10, all αijk were set to 1. All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. The error rates of two algorithms on a dataset were compared with a paired t-test carried out at the 5% significance level on the error rate statistics obtained from the 20 trials. The results are shown in Table 1. Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. On two datasets, chess and tic-tac-toe, ISA shows considerable improvement in performance over both SB and LBR. With respect to computation Table 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) and Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is statistically significantly lower than the marked SB or LBR error rate. A + indicates that the ISA error rate is statistically significantly higher. Dataset Size Annealing Audiology Breast (W) Chess (KR-KP) Credit (A) Echocardiogram Glass Heart (C) Hepatitis Horse colic House votes 84 Hypothyroid Iris Labor LED 24 Liver disorders Lung cancer Lymphography Pima Postoperative Primary tumor Promoters Solar flare Sonar Soybean Splice junction Tic-Tac-Toe Wine Zoo 898 226 699 3169 690 131 214 303 155 368 435 3163 150 57 200 345 32 148 768 90 339 106 1389 208 683 3177 958 178 101 No. of classes 6 24 2 2 2 2 6 2 2 2 2 2 3 2 10 2 3 4 2 3 22 2 2 2 19 3 2 3 7 Num. Attrib. 6 0 9 0 6 6 9 13 6 7 0 7 4 8 0 6 0 0 8 1 0 0 0 60 0 0 0 13 0 Nom. Attrib. 32 69 0 36 9 1 0 0 13 15 16 18 0 8 24 0 56 18 0 7 17 57 10 0 35 60 9 0 16 Percent error rate SB LBR ISA 1.9 3.5 2.7 29.6 29.4 30.9 3.7 2.9 + 2.8 + 1.1 12.1 3.0 13.8 14.0 13.9 33.2 34.0 35.9 26.9 27.8 29.0 16.2 16.2 17.5 14.2 - 14.2 - 11.3 20.2 16.0 17.8 5.1 10.1 7.0 0.9 0.9 1.4 6.0 6.0 5.3 8.8 6.1 7.0 40.5 40.5 40.3 36.8 36.8 36.8 56.3 56.3 56.3 15.5 - 15.5 - 13.2 21.8 22.0 22.3 33.3 33.3 33.3 54.4 53.5 54.2 7.5 7.5 7.5 20.2 18.3 + 19.4 15.4 15.6 15.9 7.1 7.2 7.9 4.7 4.3 4.4 30.3 - 13.7 - 10.3 1.1 1.1 1.1 6.4 8.4 8.4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. We will also investigate methods to improve the computational efficiency of ISA. In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. In the future, we plan to explore other models using this framework. Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. A c k n ow l e d g me n t s This work was supported by the grant T15-LM/DE07059 from the National Library of Medicine (NLM) to the University of Pittsburgh’s Biomedical Informatics Training Program. We would like to thank the three anonymous reviewers for their helpful comments. References [1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41(1):53-84. [2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14:382-417. [3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. [4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection of Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. [5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA. [6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of ContinuousValued Attributes for Classification Learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan Kaufmann.

5 0.48418456 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography

Author: Einat Klein, Rachel Mislovaty, Ido Kanter, Andreas Ruttor, Wolfgang Kinzel

Abstract: Two neural networks that are trained on their mutual output synchronize to an identical time dependant weight vector. This novel phenomenon can be used for creation of a secure cryptographic secret-key using a public channel. Several models for this cryptographic system have been suggested, and have been tested for their security under different sophisticated attack strategies. The most promising models are networks that involve chaos synchronization. The synchronization process of mutual learning is described analytically using statistical physics methods.

6 0.46245086 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

7 0.46006969 155 nips-2004-Responding to Modalities with Different Latencies

8 0.44808918 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

9 0.44067329 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

10 0.37793857 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

11 0.37769833 48 nips-2004-Convergence and No-Regret in Multiagent Learning

12 0.3704896 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

13 0.34978276 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

14 0.34555089 74 nips-2004-Harmonising Chorales by Probabilistic Inference

15 0.32827935 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities

16 0.32647759 88 nips-2004-Intrinsically Motivated Reinforcement Learning

17 0.30906358 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

18 0.30740654 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

19 0.30112055 185 nips-2004-The Convergence of Contrastive Divergences

20 0.29323971 64 nips-2004-Experts in a Markov Decision Process

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.051), (15, 0.493), (26, 0.064), (31, 0.025), (33, 0.157), (35, 0.026), (50, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99313217 50 nips-2004-Dependent Gaussian Processes

Author: Phillip Boyle, Marcus Frean

Abstract: Gaussian processes are usually parameterised in terms of their covariance functions. However, this makes it difﬁcult to deal with multiple outputs, because ensuring that the covariance matrix is positive deﬁnite is problematic. An alternative formulation is to treat Gaussian processes as white noise sources convolved with smoothing kernels, and to parameterise the kernel instead. Using this, we extend Gaussian processes to handle multiple, coupled outputs. 1

2 0.99309373 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

Author: Le Lu, Gregory D. Hager, Laurent Younes

Abstract: Visual action recognition is an important problem in computer vision. In this paper, we propose a new method to probabilistically model and recognize actions of articulated objects, such as hand or body gestures, in image sequences. Our method consists of three levels of representation. At the low level, we ﬁrst extract a feature vector invariant to scale and in-plane rotation by using the Fourier transform of a circular spatial histogram. Then, spectral partitioning [20] is utilized to obtain an initial clustering; this clustering is then reﬁned using a temporal smoothness constraint. Gaussian mixture model (GMM) based clustering and density estimation in the subspace of linear discriminant analysis (LDA) are then applied to thousands of image feature vectors to obtain an intermediate level representation. Finally, at the high level we build a temporal multiresolution histogram model for each action by aggregating the clustering weights of sampled images belonging to that action. We discuss how this high level representation can be extended to achieve temporal scaling invariance and to include Bi-gram or Multi-gram transition information. Both image clustering and action recognition/segmentation results are given to show the validity of our three tiered representation.

3 0.99227458 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

Author: Tao Xiong, Jieping Ye, Qi Li, Ravi Janardan, Vladimir Cherkassky

Abstract: Linear Discriminant Analysis (LDA) is a well-known method for feature extraction and dimension reduction. It has been used widely in many applications such as face recognition. Recently, a novel LDA algorithm based on QR Decomposition, namely LDA/QR, has been proposed, which is competitive in terms of classiﬁcation accuracy with other LDA algorithms, but it has much lower costs in time and space. However, LDA/QR is based on linear projection, which may not be suitable for data with nonlinear structure. This paper ﬁrst proposes an algorithm called KDA/QR, which extends the LDA/QR algorithm to deal with nonlinear data by using the kernel operator. Then an efﬁcient approximation of KDA/QR called AKDA/QR is proposed. Experiments on face image data show that the classiﬁcation accuracy of both KDA/QR and AKDA/QR are competitive with Generalized Discriminant Analysis (GDA), a general kernel discriminant analysis algorithm, while AKDA/QR has much lower time and space costs. 1

4 0.98576427 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks

Author: Joris M. Mooij, Hilbert J. Kappen

Abstract: We introduce a computationally efﬁcient method to estimate the validity of the BP method as a function of graph topology, the connectivity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network (“C. Elegans”). Although the method is restricted to pair-wise interactions, no local evidence (zero “biases”) and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphical models. Using this approach, we ﬁnd that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to signiﬁcantly outperform MF. 1

5 0.97373414 92 nips-2004-Kernel Methods for Implicit Surface Modeling

Author: Joachim Giesen, Simon Spalinger, Bernhard Schölkopf

Abstract: We describe methods for computing an implicit model of a hypersurface that is given only by a ﬁnite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. 1

same-paper 6 0.97021687 183 nips-2004-Temporal-Difference Networks

7 0.96407092 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

8 0.92086852 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

9 0.86751997 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

10 0.86512899 148 nips-2004-Probabilistic Computation in Spiking Populations

11 0.8558954 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.85573786 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

13 0.84422201 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

14 0.82952815 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

15 0.8283456 178 nips-2004-Support Vector Classification with Input Data Uncertainty

16 0.82552135 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

17 0.8229574 94 nips-2004-Kernels for Multi--task Learning

18 0.8159579 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

19 0.8154853 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

20 0.8144325 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning