nips nips2005 nips2005-148 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. [sent-5, score-0.278]
2 PSRs use predictions about the outcome of future tests to summarize the system state. [sent-6, score-0.639]
3 In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. [sent-8, score-0.476]
4 The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. [sent-9, score-0.184]
5 Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. [sent-10, score-0.509]
6 We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. [sent-11, score-1.255]
7 A more recently developed group of algorithms, known as predictive representations, identify state in dynamical systems by predicting what will happen in the future. [sent-15, score-0.234]
8 Algorithms following this paradigm include observable operator models [1], predictive state representations [2, 3], TD-Nets [4] and TPSRs [5]. [sent-16, score-0.268]
9 In this research we focus on predictive state representations (PSRs). [sent-17, score-0.235]
10 They have required explicit control of the system using a reset action [6, 5], or have required the incoming data stream to be generated using an open-loop policy [7]. [sent-20, score-0.186]
11 Like the myopic gradient descent algorithm [8], the algorithm we propose uses a gradient approach to move its predictions closer to its empirical observations; however, our algorithm also takes advantage of known constraints on valid test predictions. [sent-25, score-0.639]
12 We show that this constrained gradient approach is capable of discovering a set of core tests quickly, and also of making online predictions that improve as more data is available. [sent-26, score-1.371]
13 An agent in a dynamical system experiences a sequence of action-observation pairs, or ao pairs. [sent-31, score-0.268]
14 The sequence of ao pairs the agent has already experienced, beginning at the first time step, is known as a history. [sent-32, score-0.195]
15 an on of length n means that the agent chose action a1 and perceived observation o1 at the first time step, after which the agent chose a2 and perceived o2 , and so on1 . [sent-36, score-0.292]
16 A test is a sequence of ao pairs that begins immediately after a history. [sent-37, score-0.215]
17 A test is said to succeed if the observations in the sequence are observed in order, given that the actions in the sequence are chosen in order. [sent-38, score-0.19]
18 A test fails if the action sequence is taken but the observation sequence is not observed. [sent-40, score-0.246]
19 A prediction about the outcome of a test t depends on the history h that preceded it, so we write predictions as p(t|h), to represent the probability of t succeeding after history h. [sent-41, score-0.411]
20 If T is a set of tests and H is a set of histories, p(t|h) is a single value, p(T |h) is a row vector containing p(ti |h) for all tests ti ∈ T , p(t|H) is a column vector containing p(t|hj ) for all histories hj ∈ H, and p(T |H) is a matrix containing p(ti |hj ) for all ti ∈ T and hj ∈ H. [sent-48, score-1.417]
21 In this paper, we restrict our discussion of PSRs to linear PSRs, in which the function ft is a linear function of the tests in Q. [sent-52, score-0.529]
22 The tests in Q are known as core tests, and determining which tests are core tests is known as the discovery problem. [sent-54, score-2.359]
23 A one-step extension of a test t is a test aot, that prefixes the original test with a single ao pair. [sent-56, score-0.34]
24 The state vector of a PSR at time i is the set of predictions p(Q|hi ). [sent-58, score-0.196]
25 Here we use the notation that a superscript ai or oi indicates the time step of an action or observation, and a subscript ai or oi indicates that the action or observation is a particular element of the set A or O. [sent-61, score-0.749]
26 state vector is updated by computing, for each qj ∈ Q: p(qj |hi ) = p(Q|hi−1 )mai oi qj p(ai oi qj |hi−1 ) = p(ai oi |hi−1 ) p(Q|hi−1 )mai oi Thus, in order to update the PSR at each time step, the vector mt must be known for each test t ∈ X. [sent-62, score-1.23]
27 3 Constrained Gradient Learning of PSRs The goal of this paper is to develop an online algorithm for discovering and learning a PSR without the necessity of a reset action. [sent-64, score-0.264]
28 In this section, we introduce our constrained gradient approach to solving this problem. [sent-66, score-0.242]
29 1, we will assume that the set of core tests Q is given to the algorithm; we describe how Q can be estimated online in Section 3. [sent-69, score-0.99]
30 1 Learning the PSR Parameters The approach to learning taken by the constrained gradient algorithm is to approximate the matrix p(T |H), for a selected set of tests T and histories H. [sent-72, score-0.988]
31 However, as will be explained in the next section, these tests are not sufficient to take full advantage of the structure in a prediction matrix. [sent-77, score-0.552]
32 The constrained gradient algorithm requires the tests in T to satisfy two properties: 1. [sent-78, score-0.774]
33 All histories in H are histories that have been experienced by the agent. [sent-82, score-0.296]
34 The current history, hi , must always be in H in order to make online predictions, and also to compute hi+1 . [sent-83, score-0.436]
35 The only other requirement of H is that it contain sufficient histories to compute the linear functions mt for the tests in T (see Section 3. [sent-84, score-0.716]
36 When a new data point is seen and a new row is added to the matrix, the oldest row in the matrix is “forgotten. [sent-87, score-0.216]
37 The approach used to build the matrix p(T |H) is to estimate and append a new row, p(T |hi ), after each new ai oi pair is encountered. [sent-90, score-0.325]
38 2 The creation of the new row is performed in two stages: a row estimation stage, and a gradient descent stage. [sent-93, score-0.306]
39 Both stages take advantage of four constraints on the predictions p(T |h) in order to be a valid row in the prediction matrix: 1. [sent-95, score-0.295]
40 Range: 0 ≤ p(t|h) ≤ 1 Null Test: p(ε|h) = 1 Internal Consistency: p(t|h) = oj ∈O p(taoj |h) ∀a ∈ A Conditional Probability: p(t|hao) = p(aot|h)/p(ao|h) ∀a ∈ A, o ∈ O The range constraint restricts the entries in the matrix to be valid probabilities. [sent-99, score-0.213]
41 After action ai is taken and observation oi is seen, a new row for history hi = hi−1 ai oi must be added to the matrix. [sent-104, score-1.121]
42 First, as much of the new row as possible is computed using the conditional probability constraint, and the predictions for history hi−1 . [sent-105, score-0.249]
43 For all tests t ∈ T for which ai oi t ∈ T : p(t|hi ) ← p(ai oi t|hi−1 ) p(ai oi |hi−1 ) Because X ⊂ T , it is guaranteed that p(Q|hi ) is estimated in this step. [sent-106, score-1.189]
44 The second phase of adding a new row is to compute predictions for the tests t ∈ T for which ai oi t ∈ T . [sent-107, score-0.968]
45 An estimate of p(t|hi ) can be found by computing p(Q|hi )mt for an appropriate mt , using the PSR assumption that any prediction is a linear combination of core test predictions. [sent-108, score-0.596]
46 At this stage, the entire row for hi has been estimated. [sent-110, score-0.358]
47 Then, to ensure internal consistency within the row, the normalization property is enforced by setting predictions: p(taoj |hi ) ← p(t|hi )p(taoj |hi ) i oi ∈O p(taoi |h ) ∀oj ∈ O This preserves the ratio among sibling predictions and creates a valid probability distribution from them. [sent-113, score-0.431]
48 The normalization is performed by normalizing shorter tests first, which guarantees that a set of tests are not normalized to a value that will later change. [sent-114, score-1.027]
49 The gradient descent stage of estimating a new row moves the constraint-generated predictions in the direction of the gradient created by the new observation. [sent-116, score-0.47]
50 Any prediction p(tao|hi ) whose test tao is successfully executed over the next several time steps is updated using p(tao|hi ) ← (1 − α)p(tao|hi ) + α(p(t|hi )), for some learning rate 0 ≤ α ≤ 1. [sent-117, score-0.286]
51 In this case, we maintain a buffer of ao pairs between the current time step and the histories that are added to the prediction matrix. [sent-123, score-0.399]
52 The length of the buffer is the length of the longest test in T . [sent-124, score-0.179]
53 their action sequence is executed but their observation sequence is not observed) will have their probability reduced due to this re-normalization step. [sent-128, score-0.195]
54 Once a new row for hi is estimated, the current PSR state vector is p(Q|hi ). [sent-131, score-0.479]
55 2 Discovery of Core Tests In the previous section, we assumed that the set of core tests was given to the algorithm. [sent-135, score-0.872]
56 A rudimentary, but effective, method of finding core tests is to choose tests whose corresponding columns of the matrix p(T |H) are most linearly unrelated to the set of core tests already selected. [sent-137, score-2.307]
57 The condition number of the matrix p({Q, t}|H) is an indication of the linear relatedness of test t; if it is well-conditioned, the test is likely to be linearly independent. [sent-139, score-0.249]
58 To choose core tests, we find the test t in X whose matrix p({Q, t}|H) is most well-conditioned. [sent-140, score-0.492]
59 If the condition number of that test is below a threshold parameter, it is chosen as a new core test. [sent-141, score-0.455]
60 Because candidate tests are selected from X, the discovered set Q will be a regular form PSR [10]. [sent-143, score-0.521]
61 The above core test selection procedure runs after every N data points are seen, where N is the maximum number of histories kept in H. [sent-145, score-0.584]
62 After each new core test is selected, T is augmented with the one-step extensions of the new test, as well as any other tests needed to satisfy the rules in Section 3. [sent-146, score-0.954]
63 4 Experiments and Results The goal of the constrained gradient algorithm is to choose a correct set of core tests and to make accurate, online predictions. [sent-148, score-1.286]
64 The error for each history 1 hi was computed using the error measure |O| oj ∈O (p(ai+1 oj |hi ) − p(ai+1 oj |hi ))2 [7]. [sent-155, score-0.744]
65 ˆ This measures the mean error in the one-step tests involving the action that was actually taken at step i + 1. [sent-156, score-0.609]
66 The size bound on H was set to 1000, and the condition threshold for adding new tests was 10. [sent-158, score-0.499]
67 The core test discovery procedure was run every 1000 data points. [sent-160, score-0.571]
68 1 Discovery Results In this section, we examine the success of the constrained gradient algorithm at discovering core tests. [sent-162, score-0.679]
69 Table 1 shows, for each test domain, the true number of core tests for the Table 1: The number of core tests found by the constrained gradient algorithm. [sent-163, score-2.068]
70 5 2000 Suffix-History |Q|/Correct # Data 2 4000 2 4000 7 1024000 9 1024000 9 32000 5 1024000 3 2048000 dynamical system (|Q|), the number of core tests selected by the constrained gradient algorithm (|Q|), and how many of the selected core tests were actually core tests (Correct). [sent-182, score-3.008]
71 Table 1 also shows the time step at which the last core test was chosen (# Data). [sent-184, score-0.483]
72 In all domains, the algorithm found a majority of the core tests after only several thousand data points; in several cases, the core tests were found after only a single run of the core test selection procedure. [sent-185, score-2.232]
73 All of the core tests found by the suffix-history algorithm were true core tests. [sent-187, score-1.278]
74 In all cases except the 4x3 Maze, the constrained gradient algorithm was able to find at least as many core tests as the suffix-history method, and required significantly less data. [sent-188, score-1.147]
75 To be fair, the suffix-history algorithm uses a conservative approach of selecting core tests, and therefore requires more data. [sent-189, score-0.406]
76 The constrained gradient algorithm chooses tests that give an early indication of being linearly independent. [sent-190, score-0.822]
77 Therefore, the constrained gradient finds most, or all, core tests extremely quickly, but can also choose tests that are not linearly independent. [sent-191, score-1.64]
78 2 Online and Offline Results Figure 1 shows the performance of the constrained gradient approach, in online and offline settings. [sent-193, score-0.36]
79 The question answered by the online experiments is: How accurately can the constrained gradient algorithm predict the outcome of the next time step? [sent-194, score-0.425]
80 At each time i, we measured the error in the algorithm’s predictions of p(ai+1 oj |hi ) for each oj ∈ O. [sent-195, score-0.361]
81 The question posed for the offline experiments was: What is the long-term performance of the PSRs learned by the constrained gradient algorithm? [sent-197, score-0.242]
82 To test this, we stopped the learning process at different points in the training sequence and computed the current PSR. [sent-198, score-0.18]
83 The initial state vector for the offline tests was set to the column means of p(Q|H), which approximates the state vector of the system’s stationary distribution. [sent-199, score-0.675]
84 In Figure 1, the ‘Offline’ plot shows the mean error of this PSR on a test sequence of length 10,000. [sent-200, score-0.178]
85 The y-axis shows the mean error on the one-step predictions (Online) or on a test sequence (Offline and Suffix-History). [sent-208, score-0.252]
86 the constrained gradient approach is competitive with current PSR learning algorithms. [sent-211, score-0.328]
87 The performance plateau in the 4x3 Maze and Network domains is unsurprising, because in these domains only a subset of the correct core tests were found (see Table 1). [sent-212, score-0.975]
88 The plateau in the Bridge domain is more concerning, because in this domains all of the correct core tests were found. [sent-213, score-0.946]
89 We suspect this may be due to a local minimum in the error space; more tests need to be performed to investigate this phenomenon. [sent-214, score-0.522]
90 5 Future Work and Conclusion We have demonstrated that the constrained gradient algorithm can do online learning and discovery of predictive state representations from an arbitrary stream of experience. [sent-215, score-0.807]
91 In the current method of core test selection, the condition of the core test matrix p(Q|H) is important. [sent-218, score-0.98]
92 If the matrix becomes ill-conditioned, it prevents new core tests from becoming selected. [sent-219, score-0.909]
93 This can happen if the true core test matrix p(Q|H) is poorly conditioned (because some core tests are similar), or if incorrect core tests are added to Q. [sent-220, score-2.269]
94 To prevent this problem, there needs to be a mechanism for removing chosen core tests if they turn out to be linearly dependent. [sent-221, score-0.899]
95 Also, the condition threshold should be gradually increased during learning, to allow more obscure core tests to be selected. [sent-222, score-0.872]
96 To date, the constrained gradient algorithm is the only PSR algorithm that takes advantage of the sequential nature of the data stream experienced by the agent, and the constraints such a sequence imposes on the system. [sent-226, score-0.422]
97 Empirical results show that, while there is room for improvement, the constrained gradient algorithm is competitive in both discovery and learning of PSRs. [sent-229, score-0.444]
98 Learning and discovery of predictive state representations in dynamical systems with reset. [sent-249, score-0.424]
99 Learning predictive state representations in dynamical systems without reset. [sent-253, score-0.308]
100 An online algorithm for discovery and learning of prediction state representations. [sent-259, score-0.414]
wordName wordTfidf (topN-words)
[('tests', 0.499), ('core', 0.373), ('psr', 0.344), ('hi', 0.285), ('psrs', 0.204), ('oi', 0.201), ('gradient', 0.129), ('histories', 0.129), ('online', 0.118), ('discovery', 0.116), ('oj', 0.115), ('constrained', 0.113), ('predictions', 0.108), ('ao', 0.094), ('tao', 0.094), ('predictive', 0.093), ('mx', 0.09), ('ine', 0.089), ('mt', 0.088), ('ai', 0.087), ('maze', 0.086), ('test', 0.082), ('representations', 0.074), ('row', 0.073), ('dynamical', 0.073), ('taoj', 0.072), ('satinder', 0.072), ('state', 0.068), ('history', 0.068), ('agent', 0.062), ('valid', 0.061), ('alberta', 0.06), ('action', 0.059), ('reset', 0.056), ('prediction', 0.053), ('suf', 0.045), ('michael', 0.045), ('qj', 0.041), ('pomdps', 0.04), ('bridge', 0.04), ('sequence', 0.039), ('hj', 0.039), ('experienced', 0.038), ('stream', 0.037), ('matrix', 0.037), ('aot', 0.036), ('bowling', 0.036), ('float', 0.036), ('mai', 0.036), ('shuttle', 0.036), ('taoi', 0.036), ('littman', 0.036), ('james', 0.034), ('policy', 0.034), ('icml', 0.034), ('length', 0.034), ('algorithm', 0.033), ('current', 0.033), ('added', 0.033), ('observable', 0.033), ('consistency', 0.032), ('outcome', 0.032), ('cheese', 0.031), ('paint', 0.031), ('tiger', 0.031), ('descent', 0.031), ('ti', 0.031), ('discovering', 0.031), ('executed', 0.031), ('ft', 0.03), ('actions', 0.03), ('domains', 0.029), ('null', 0.029), ('buffer', 0.029), ('matthew', 0.029), ('normalization', 0.029), ('step', 0.028), ('linearly', 0.027), ('observation', 0.027), ('pomdp', 0.027), ('competitive', 0.027), ('peter', 0.026), ('learning', 0.026), ('edmonton', 0.025), ('rows', 0.025), ('update', 0.025), ('singh', 0.024), ('perceived', 0.024), ('expanded', 0.024), ('twentieth', 0.024), ('wolfe', 0.024), ('plateau', 0.024), ('error', 0.023), ('richard', 0.022), ('batch', 0.022), ('selected', 0.022), ('correct', 0.021), ('indication', 0.021), ('sutton', 0.021), ('vector', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 148 nips-2005-Online Discovery and Learning of Predictive State Representations
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
2 0.20321718 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
3 0.1171568 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
Author: Eddie Rafols, Anna Koop, Richard S. Sutton
Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.
4 0.098692954 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
5 0.089667976 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
6 0.079471976 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
7 0.0767859 153 nips-2005-Policy-Gradient Methods for Planning
8 0.076529749 76 nips-2005-From Batch to Transductive Online Learning
9 0.072346836 36 nips-2005-Bayesian models of human action understanding
10 0.069752745 144 nips-2005-Off-policy Learning with Options and Recognizers
11 0.067450434 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
12 0.065044723 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
13 0.063720793 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
14 0.059381686 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
15 0.057731509 78 nips-2005-From Weighted Classification to Policy Search
16 0.056147799 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
17 0.052802116 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
18 0.048716489 103 nips-2005-Kernels for gene regulatory regions
19 0.04850227 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
20 0.044991031 41 nips-2005-Coarse sample complexity bounds for active learning
topicId topicWeight
[(0, 0.16), (1, 0.001), (2, 0.139), (3, -0.052), (4, 0.046), (5, 0.029), (6, 0.055), (7, 0.022), (8, 0.007), (9, 0.073), (10, -0.101), (11, 0.105), (12, 0.023), (13, -0.043), (14, 0.062), (15, -0.024), (16, -0.067), (17, 0.115), (18, -0.052), (19, 0.086), (20, -0.16), (21, 0.084), (22, -0.037), (23, -0.046), (24, 0.031), (25, -0.065), (26, 0.105), (27, -0.005), (28, 0.265), (29, 0.248), (30, 0.167), (31, -0.256), (32, 0.01), (33, -0.05), (34, -0.075), (35, -0.07), (36, -0.005), (37, -0.044), (38, -0.052), (39, -0.044), (40, -0.065), (41, 0.036), (42, -0.206), (43, -0.015), (44, -0.106), (45, -0.107), (46, 0.014), (47, -0.098), (48, 0.028), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.97174066 148 nips-2005-Online Discovery and Learning of Predictive State Representations
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
2 0.75347519 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
3 0.45699194 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
4 0.44706854 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
5 0.44119993 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
6 0.38001657 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
7 0.37308145 76 nips-2005-From Batch to Transductive Online Learning
8 0.33958337 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
9 0.3305237 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
10 0.32685542 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
11 0.3027007 36 nips-2005-Bayesian models of human action understanding
12 0.28561285 50 nips-2005-Convex Neural Networks
13 0.27205518 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
14 0.27165633 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
15 0.26631343 153 nips-2005-Policy-Gradient Methods for Planning
16 0.2571899 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
17 0.24638173 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation
18 0.2462042 35 nips-2005-Bayesian model learning in human visual perception
19 0.24543221 103 nips-2005-Kernels for gene regulatory regions
20 0.24191065 144 nips-2005-Off-policy Learning with Options and Recognizers
topicId topicWeight
[(3, 0.041), (10, 0.042), (19, 0.292), (26, 0.011), (27, 0.047), (31, 0.064), (34, 0.089), (44, 0.011), (55, 0.043), (69, 0.056), (73, 0.037), (88, 0.08), (91, 0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.79480743 148 nips-2005-Online Discovery and Learning of Predictive State Representations
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
2 0.51536673 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
Author: O. P. Kreidl, Alan S. Willsky
Abstract: Given a directed graphical model with binary-valued hidden nodes and real-valued noisy observations, consider deciding upon the maximum a-posteriori (MAP) or the maximum posterior-marginal (MPM) assignment under the restriction that each node broadcasts only to its children exactly one single-bit message. We present a variational formulation, viewing the processing rules local to all nodes as degrees-of-freedom, that minimizes the loss in expected (MAP or MPM) performance subject to such online communication constraints. The approach leads to a novel message-passing algorithm to be executed offline, or before observations are realized, which mitigates the performance loss by iteratively coupling all rules in a manner implicitly driven by global statistics. We also provide (i) illustrative examples, (ii) assumptions that guarantee convergence and efficiency and (iii) connections to active research areas. 1
3 0.51088673 78 nips-2005-From Weighted Classification to Policy Search
Author: Doron Blatt, Alfred O. Hero
Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1
4 0.50925905 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
Author: Firas Hamze, Nando de Freitas
Abstract: This paper presents a new sampling algorithm for approximating functions of variables representable as undirected graphical models of arbitrary connectivity with pairwise potentials, as well as for estimating the notoriously difficult partition function of the graph. The algorithm fits into the framework of sequential Monte Carlo methods rather than the more widely used MCMC, and relies on constructing a sequence of intermediate distributions which get closer to the desired one. While the idea of using “tempered” proposals is known, we construct a novel sequence of target distributions where, rather than dropping a global temperature parameter, we sequentially couple individual pairs of variables that are, initially, sampled exactly from a spanning tree of the variables. We present experimental results on inference and estimation of the partition function for sparse and densely-connected graphs.
5 0.50560963 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
Author: Alan L. Yuille
Abstract: We show that linear generalizations of Rescorla-Wagner can perform Maximum Likelihood estimation of the parameters of all generative models for causal reasoning. Our approach involves augmenting variables to deal with conjunctions of causes, similar to the agumented model of Rescorla. Our results involve genericity assumptions on the distributions of causes. If these assumptions are violated, for example for the Cheng causal power theory, then we show that a linear Rescorla-Wagner can estimate the parameters of the model up to a nonlinear transformtion. Moreover, a nonlinear Rescorla-Wagner is able to estimate the parameters directly to within arbitrary accuracy. Previous results can be used to determine convergence and to estimate convergence rates. 1
6 0.50542885 144 nips-2005-Off-policy Learning with Options and Recognizers
7 0.5052368 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
8 0.50523174 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
9 0.50224102 30 nips-2005-Assessing Approximations for Gaussian Process Classification
10 0.49937719 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks
11 0.49667504 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
12 0.49290645 23 nips-2005-An Application of Markov Random Fields to Range Sensing
13 0.49279395 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
14 0.49228394 184 nips-2005-Structured Prediction via the Extragradient Method
15 0.49213922 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
16 0.49182326 105 nips-2005-Large-Scale Multiclass Transduction
17 0.49102846 133 nips-2005-Nested sampling for Potts models
18 0.49075943 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
19 0.48891723 46 nips-2005-Consensus Propagation
20 0.48859119 102 nips-2005-Kernelized Infomax Clustering