nips nips2009 nips2009-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.
Reference: text
sentIndex sentText sentNum sentScore
1 The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. [sent-2, score-0.386]
2 Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. [sent-6, score-0.37]
3 Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. [sent-7, score-0.23]
4 Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments. [sent-8, score-0.157]
5 1 Introduction Linear Dyna-style planning extends Dyna to linear function approximation (Sutton, Szepesv´ ri, a Geramifard & Bowling, 2008), and can be used in large-scale applications. [sent-9, score-0.251]
6 However, existing Dyna and linear Dyna-style planning algorithms are all single-step, because they only simulate sampled features one step ahead. [sent-10, score-0.26]
7 We extend linear Dyna architecture by using a multi-step linear model of the world, which gives what we call the multi-step linear Dyna-style planning. [sent-12, score-0.185]
8 Multi-step linear Dyna-style planning is more advantageous than existing linear Dyna, because a multi-step model of the world can project a feature multiple steps into the future and give more steps of results from the feature. [sent-13, score-0.469]
9 For policy evaluation we introduce two multi-step linear models. [sent-14, score-0.202]
10 The first is generated by iterating the one-step linear model, but is computationally complex when the number of features is large. [sent-15, score-0.074]
11 The second, which we call the λ-model, interpolates between the one-step linear model and an infinitestep linear model of the world, and is computationally efficient to compute online. [sent-16, score-0.163]
12 Our multi-step linear Dyna-style planning for policy evaluation, Dyna(k), uses the multi-step linear models to generate k-steps-ahead prediction of the sampled feature, and applies a generalized TD (temporal dif1 ference, e. [sent-17, score-0.434]
13 For the problem of control, related work include least-squares policy iteration (LSPI) (Lagoudakis & Parr, 2001; Lagoudakis & Parr, 2003; Li, Littman & Mansley, 2009), and linear Dyna-style planning for control. [sent-21, score-0.388]
14 LSPI is an offline algorithm, that learns a greedy policy out of a data set of experience, through a number of iterations, each of which sweeps the data set and alternates between LSTD and policy improvement. [sent-22, score-0.313]
15 (2008) explored the use of linear function approximation with Dyna for control, which does planning using a set of linear action models built from state to state. [sent-24, score-0.348]
16 In this paper, we first build a one-step model from state-action pair to state-action pair through tracking the greedy policy. [sent-25, score-0.12]
17 Using this tracking model for planning is in fact another way of doing single-step linear Dyna-style planning. [sent-26, score-0.301]
18 In a similar manner to policy evaluation, we also have two multi-step models for control. [sent-27, score-0.146]
19 We build the iterated multi-step model by iterating the one-step tracking model. [sent-28, score-0.158]
20 Also, we build a λ-model for control by interpolating the one-step tracking model and the infinite-step model (also built through tracking). [sent-29, score-0.156]
21 As the infinite-step model coincides with the LSTD solution, we actually propose an online LSTD control algorithm. [sent-30, score-0.12]
22 Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna. [sent-31, score-0.281]
23 Results on the Mountain-car experiment show that multi-step linear Dyna can find the optimal policy faster than single-step linear Dyna; however, the performance of multistep linear Dyna does not always improve as the number of projection steps increases. [sent-32, score-0.351]
24 In fact, LSTD control and the infinite-step linear Dyna for control are both unstable, and some intermediate value of k makes the k-step linear Dyna for control perform the best. [sent-33, score-0.314]
25 , N }, the problem of policy evaluation is to predict the long-term reward of a policy π for every state s ∈ : Ë ∞ V π (s) = γ t rt , 0 < γ < 1, s0 = s, t=0 where rt is the reward received by the agent at time t. [sent-37, score-0.468]
26 At time t, linear TD(0) updates the weights as Ë Ê θt+1 = θt + αt δt φt , T T δt = rt + γθt φt+1 − θt φt , where αt is a positive step-size and φt corresponds to φ(st ). [sent-52, score-0.089]
27 Modern Dyna is more advantageous in the use of linear function approximation, which is called linear Dyna-style planning (Sutton et al. [sent-54, score-0.323]
28 We denote the state transition probability π matrix of policy π by P π , whose (i, j)th component is Pi,j = Eπ {st+1 = j|st = i}; and denote the π expected reward vector of policy π by R , whose ith component is the expected reward of leaving state i in one step. [sent-56, score-0.402]
29 Linear Dyna tries to estimate a compressed model of policy π: (F π )T = (ΦT Dπ Φ)−1 · ΦT Dπ P π Φ; f π = (ΦT Dπ Φ)−1 · ΦT Dπ Rπ , where Dπ is the N × N matrix whose diagonal entries correspond to the steady distribution of states under policy π. [sent-57, score-0.29]
30 Dyna repeats some steps of planning in each of which it samples a feature, projects it using the world model, and plans using linear TD(0) based on the imaginary experience. [sent-59, score-0.341]
31 For policy evaluation, the 2 fixed-point of linear Dyna is the same as that of linear TD(0) under some assumptions (Tsitsiklis & Van Roy, 1997; Sutton et al. [sent-60, score-0.244]
32 3 The Multi-step Linear Model In the lookup table representation, (P π )T and Rπ constitute the one-step world model. [sent-62, score-0.062]
33 The k-step transition model of the world is obtained by iterating (P π )T , k times with discount (Sutton, 1995): P (k) = (γ(P π )T )k , ∀k = 1, 2, . [sent-63, score-0.086]
34 P (k) and R(k) predict a feature k steps into the future. [sent-70, score-0.064]
35 In particular, P (k) φ is the feature of the expected state after k steps from φ, and (R(k) )T φ is the expected accumulated rewards in k steps from φ. [sent-71, score-0.15]
36 1 The Iterated Multi-step Linear Model In the linear function approximation, F π and f π constitute the one-step linear model. [sent-77, score-0.106]
37 Similar to the lookup table representation, we can iterate F π , k times, and accumulate the approximated rewards along the way: k−1 F (k) = (γF π )k ; f (k) = (γ(F π )T )j f π . [sent-78, score-0.098]
38 j=0 (k) (k) We call (F , f ) the iterated multi-step linear model. [sent-79, score-0.122]
39 Given an imaginary feature φτ , we look k steps ahead to see our future feature by applying F (k) : ˜ ˜ φ(k) = F (k) φτ . [sent-91, score-0.167]
40 1 This means that the more steps we look into the future from a given feature, the more ambiguous is our resulting feature. [sent-93, score-0.064]
41 3 can use a decayed one-step linear model to approximate the effects of looking multiple steps into the future: L(k) = (λγ)k−1 γF π , parameterized by a factor λ ∈ (0, 1]. [sent-98, score-0.118]
42 When k = 1, we have L(1) = F (1) = γF π and l(1) = f (1) = f π , recovering the one-step model used by existing linear Dyna. [sent-101, score-0.062]
43 Notice that L(k) diminishes as k grows, which is consistent with the fact that F (k) also diminishes as k grows. [sent-102, score-0.074]
44 4 Multi-step Linear Dyna-style Planning for Policy Evaluation The architecture of multi-step linear Dyna-style planning, Dyna(k), is shown in Algorithm 1. [sent-111, score-0.059]
45 For example, in the algorithm we can take M (k) = F (k) and m(k) = f (k) , giving us a linear Dyna architecture using the iterated multi-step linear model, which we call the Dyna(k)-iterate. [sent-113, score-0.181]
46 In the following we present the family of Dyna(k) planning algorithms that use the λ-model. [sent-114, score-0.214]
47 We first develop a planning algorithm for the infinite-step model, and based on it we then present Dyna(k) planning using the λ-model for any finite k. [sent-115, score-0.41]
48 1 Dyna(∞): Planning using the Infinite-step Model The infinite-step model is preferable in computation because F (∞) diminishes and the model reduces to f (∞) . [sent-117, score-0.069]
49 We can accumulate Aπ and bπ online like LSTD (Bradtke & Barto, 1996; Boyan, 1999; Xu et al. [sent-119, score-0.074]
50 4 Algorithm 1 Dyna(k) algorithm for evaluating policy π (using any valid k-step model). [sent-128, score-0.137]
51 Given an imagit+1 t+1 ˜ nary feature φ, we look k steps ahead to see the future features and rewards: ˜ ˜ φ(k) = L(k) φ; ˜ r(k) = (l(k) )T φ. [sent-131, score-0.113]
52 ˜ ˜ ˜ Thus we obtain an imaginary k-step transition experience φ → (φ(k) , k-step version of linear TD(0): r(k) ), on which we apply a ˜ ˜ ˜ ˜T ˜ ˜T ˜ ˜ θτ +1 = θτ + α(˜(k) + θτ φ(k) − θτ φ)φ. [sent-132, score-0.102]
53 r We call this algorithm the Dyna(k)-lambda planning algorithm. [sent-133, score-0.223]
54 5 Planning for Control Planning for control is more difficult than that for policy evaluation because in control the policy changes from time step to time step. [sent-137, score-0.429]
55 Linear Dyna uses a separate model for each action, and these action models are from state to state (Sutton et al. [sent-138, score-0.105]
56 Our model for control is different in that it is from state-action pair to state-action pair. [sent-140, score-0.095]
57 However, rather than building a model for all stateaction pairs, we build only one state-action model that tracks the sequence of greedy actions. [sent-141, score-0.102]
58 Using this greedy-tracking model is another way of doing linear Dyna-style planning. [sent-142, score-0.062]
59 In the following we first build the single-step greedy-tracking model and the infinite-step greedy-tracking model, and based on these tracking models we build the iterated model and the λ-model. [sent-143, score-0.168]
60 Our extension of linear Dyna to control contains a TD control step (we use Q-learning), and we call it the linear Dyna-Q architecture. [sent-144, score-0.246]
61 We build a single-step projection matrix between state-action pairs, F , by moving its projection of the current feature towards the greedy next state-action feature (tracking): Ft+1 = Ft + βt (φt+1 − Ft φt )φT . [sent-148, score-0.14]
62 t 5 (5) Algorithm 2 Dyna-Q(k)-lambda: k-step linear Dyna-Q algorithm for control (using the λ-model). [sent-149, score-0.114]
63 Once the one-step model and the infinite-step model are available, we interpolate them and compute the λ-model in a similar manner to policy evaluation. [sent-153, score-0.178]
64 The complete multi-step Dyna-Q control algorithm using the λ-model is shown in Algorithm 2. [sent-154, score-0.068]
65 We noticed that f (∞) can be directly used for control, giving an online LSTD control algorithm. [sent-155, score-0.104]
66 We can also extend the iterated multi-step model and Dyna(k)-iterate to control. [sent-156, score-0.074]
67 Given the singlestep greedy-tracking model, we can iterate it and get the iterated multi-step linear model in a similar way to policy evaluation. [sent-157, score-0.284]
68 The linear Dyna for control using the iterated greedy-tracking model (which we call Dyna-Q(k)-iterate) is straightforward and thus not shown. [sent-158, score-0.206]
69 Previously it was shown that linear Dyna can learn a policy faster than model-free TD methods in the beginning episodes (Sutton et al. [sent-162, score-0.246]
70 However, after some episodes, their implementation of linear Dyna became poorer than TD. [sent-164, score-0.061]
71 A possible reason leading to their results may be that the step-sizes of learning, modeling and planning were set to the same value. [sent-165, score-0.215]
72 The weights of various learning algorithms, f π for the linear Dyna, and bπ for Dyna(k) were all initialized to zero. [sent-186, score-0.059]
73 In the planning step, all Dyna algorithms sampled a unit basis vector whose nonzero component was in a uniformly random location. [sent-188, score-0.214]
74 In the following we report the results of planning only once. [sent-189, score-0.205]
75 All linear Dyna algorithms were found to be significantly and consistently faster than TD. [sent-192, score-0.075]
76 Furthermore, multi-step linear Dyna algorithms were much faster than single-step linear Dyna algorithms. [sent-193, score-0.121]
77 2 Mountain-car We used the same Mountain-car environment and tile coding as in the linear Dyna paper (Sutton et al. [sent-206, score-0.079]
78 Because the feature and matrix are really large, we were not able to compute the iterated model, and hence we only present here the results of Dyna-Q(k)-lambda. [sent-210, score-0.085]
79 We recorded the state-action pairs online and replayed the feature of a past state-action pair in planning. [sent-226, score-0.095]
80 We also compared the linear Dyna-style planning for control (with state features) (Sutton et al. [sent-227, score-0.357]
81 In linear Dyna-style planning for control we replayed a state feature of a past time step, and projected it using the model of the action that was selected at that time step. [sent-229, score-0.434]
82 7 −100 Dyna−Q(10)−lambda Dyna−Q(5)−lambda Online Return −150 −200 Dyna−Q(1) Dyna−Q(20)−lambda −250 Dyna−Q(∞) Linear Dyna −300 Q−learning −350 5 10 15 20 Episode Figure 2: Results on Mountain-car: comparison of online return of Dyna-Q(k)-lambda, Q-learning and linear Dyna for control. [sent-232, score-0.082]
83 Linear Dyna-style planning algorithms were found to be significantly faster than Q-learning. [sent-234, score-0.234]
84 Multi-step planning algorithms can be still faster than single-step planning algorithms. [sent-235, score-0.439]
85 The results also show that planning too many steps into the future is harmful, e. [sent-236, score-0.26]
86 In fact, Dyna-Q(∞) and LSTD control algorithm were both unstable, and typically failed once or twice in 30 runs. [sent-240, score-0.068]
87 The intuition is that in control the policy changes from time step to time step and the model is highly non-stationary. [sent-241, score-0.221]
88 By solving the model and looking infinite steps into the future, LSTD and Dyna-Q(∞) magnify the errors in the model. [sent-242, score-0.09]
89 7 Conclusion and Future Work We have taken important steps towards extending linear Dyna-style planning to multi-step planning. [sent-243, score-0.288]
90 Multi-step linear Dyna-style planning uses multi-step linear models to project a simulated feature multiple steps into the future. [sent-244, score-0.372]
91 For control, we proposed a different way of doing linear Dyna-style planning, that builds a model from state-action pair to state-action pair, and tracks the greedy action selection. [sent-245, score-0.14]
92 Experimental results show that multi-step linear Dyna-style planning leads to better performance than existing single-step linear Dyna-style planning on Boyan chain and Mountaincar problems. [sent-246, score-0.514]
93 Our experimental results show that linear Dyna-style planning can achieve a better performance by using different step-sizes for learning, modeling, and planning than using a uniform step-size for the three sub-procedures. [sent-247, score-0.456]
94 While it is not clear from previous work, our results fully demonstrate the advantages of linear Dyna over TD/Q-learning for both policy evaluation and control. [sent-248, score-0.202]
95 Our work also sheds light on why previous attempts on developing independent online LSTD control were not successful (e. [sent-249, score-0.104]
96 LSTD and Dyna-Q(∞) can become unstable because they magnify the model errors by looking infinite steps into the future. [sent-253, score-0.108]
97 Current experiments do not include comparisons with any other LSTD control algorithm because we did not find in the literature an independent LSTD control algorithm. [sent-254, score-0.136]
98 LSPI is usually off-line, and its extension to online control has to deal with online exploration (Li et al. [sent-255, score-0.169]
99 , 2002; Peters & Schaal, 2008); however, LSTD there is still not an independent control algorithm. [sent-258, score-0.068]
100 Dyna-style planning with a linear function approximation and prioritized sweeping. [sent-321, score-0.251]
wordName wordTfidf (topN-words)
[('dyna', 0.884), ('lstd', 0.261), ('planning', 0.205), ('policy', 0.137), ('td', 0.102), ('sutton', 0.1), ('boyan', 0.082), ('ft', 0.078), ('control', 0.068), ('iterated', 0.058), ('linear', 0.046), ('st', 0.043), ('rt', 0.043), ('barto', 0.04), ('steps', 0.037), ('diminishes', 0.037), ('online', 0.036), ('tracking', 0.034), ('reward', 0.033), ('lambda', 0.033), ('rmse', 0.031), ('bradtke', 0.031), ('geramifard', 0.031), ('action', 0.028), ('episodes', 0.028), ('iterating', 0.028), ('iterate', 0.027), ('lspi', 0.027), ('imaginary', 0.027), ('feature', 0.027), ('greedy', 0.026), ('world', 0.026), ('rewards', 0.026), ('szepesv', 0.025), ('accumulate', 0.023), ('state', 0.023), ('ahead', 0.022), ('lookup', 0.022), ('build', 0.022), ('interpolates', 0.021), ('mansley', 0.021), ('nedich', 0.021), ('replayed', 0.021), ('traj', 0.021), ('faster', 0.02), ('evaluation', 0.019), ('projection', 0.019), ('looking', 0.019), ('call', 0.018), ('unstable', 0.018), ('tile', 0.018), ('borkar', 0.018), ('schaal', 0.018), ('magnify', 0.018), ('parr', 0.018), ('intermediate', 0.018), ('future', 0.018), ('lagoudakis', 0.016), ('tsitsiklis', 0.016), ('model', 0.016), ('bt', 0.016), ('transition', 0.016), ('poorer', 0.015), ('bowling', 0.015), ('et', 0.015), ('eligibility', 0.015), ('peters', 0.015), ('bertsekas', 0.015), ('maxa', 0.015), ('recursive', 0.014), ('exploration', 0.014), ('constitute', 0.014), ('initialized', 0.013), ('experience', 0.013), ('shrinks', 0.013), ('tracks', 0.013), ('bellman', 0.013), ('learns', 0.013), ('architecture', 0.013), ('alberta', 0.012), ('inversion', 0.012), ('chain', 0.012), ('shifting', 0.012), ('littman', 0.012), ('powers', 0.012), ('xu', 0.011), ('roy', 0.011), ('simulated', 0.011), ('pair', 0.011), ('china', 0.011), ('advantageous', 0.011), ('temporal', 0.011), ('li', 0.01), ('modeling', 0.01), ('look', 0.009), ('algorithms', 0.009), ('manner', 0.009), ('reinforcement', 0.009), ('stateaction', 0.009), ('bangalore', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.
2 0.12546021 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
3 0.0759537 113 nips-2009-Improving Existing Fault Recovery Policies
Author: Guy Shani, Christopher Meek
Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1
4 0.071142592 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum
Abstract: Everyday social interactions are heavily influenced by our snap judgments about others’ goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e.g., that one agent is ‘helping’ or ‘hindering’ another’s attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agent’s behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative. 1
5 0.065245137 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
6 0.061894268 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
7 0.061722882 242 nips-2009-The Infinite Partially Observable Markov Decision Process
8 0.060164444 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
9 0.05605498 45 nips-2009-Beyond Convexity: Online Submodular Minimization
10 0.054436803 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
11 0.044511512 53 nips-2009-Complexity of Decentralized Control: Special Cases
12 0.043853935 134 nips-2009-Learning to Explore and Exploit in POMDPs
13 0.041465193 220 nips-2009-Slow Learners are Fast
14 0.036940902 54 nips-2009-Compositionality of optimal control laws
15 0.03632696 181 nips-2009-Online Learning of Assignments
16 0.035488982 177 nips-2009-On Learning Rotations
17 0.032941759 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
18 0.031940907 221 nips-2009-Solving Stochastic Games
19 0.031052241 63 nips-2009-DUOL: A Double Updating Approach for Online Learning
20 0.028700355 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model
topicId topicWeight
[(0, -0.062), (1, 0.027), (2, 0.1), (3, -0.108), (4, -0.092), (5, 0.051), (6, 0.02), (7, 0.006), (8, -0.009), (9, 0.02), (10, 0.059), (11, 0.037), (12, 0.003), (13, 0.052), (14, -0.025), (15, 0.031), (16, -0.014), (17, -0.025), (18, -0.013), (19, 0.006), (20, -0.019), (21, -0.059), (22, 0.059), (23, -0.045), (24, -0.097), (25, -0.003), (26, -0.097), (27, 0.062), (28, -0.002), (29, -0.003), (30, -0.032), (31, -0.04), (32, -0.035), (33, -0.015), (34, 0.013), (35, -0.03), (36, -0.016), (37, 0.006), (38, -0.047), (39, 0.059), (40, 0.062), (41, -0.008), (42, 0.046), (43, -0.018), (44, -0.049), (45, 0.064), (46, -0.029), (47, 0.03), (48, 0.055), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.9341836 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.
2 0.71015644 113 nips-2009-Improving Existing Fault Recovery Policies
Author: Guy Shani, Christopher Meek
Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1
3 0.7089048 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya
Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.
4 0.70751512 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
5 0.70712203 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
Author: Keith Bush, Joelle Pineau
Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1
6 0.64790261 134 nips-2009-Learning to Explore and Exploit in POMDPs
7 0.56413293 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
8 0.51136941 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
9 0.50402182 54 nips-2009-Compositionality of optimal control laws
10 0.4370904 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
11 0.42199573 242 nips-2009-The Infinite Partially Observable Markov Decision Process
12 0.39148206 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
13 0.37676066 16 nips-2009-A Smoothed Approximate Linear Program
14 0.34979516 48 nips-2009-Bootstrapping from Game Tree Search
15 0.33712146 220 nips-2009-Slow Learners are Fast
16 0.33621335 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities
17 0.32803124 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization
18 0.3178241 181 nips-2009-Online Learning of Assignments
19 0.30444196 63 nips-2009-DUOL: A Double Updating Approach for Online Learning
20 0.30335623 14 nips-2009-A Parameter-free Hedging Algorithm
topicId topicWeight
[(24, 0.038), (25, 0.04), (35, 0.077), (36, 0.052), (39, 0.047), (53, 0.268), (58, 0.066), (61, 0.149), (71, 0.035), (86, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.74274737 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
Author: Hengshuai Yao, Shalabh Bhatnagar, Dongcui Diao, Richard S. Sutton, Csaba Szepesvári
Abstract: In this paper we introduce a multi-step linear Dyna-style planning algorithm. The key element of the multi-step linear Dyna is a multi-step linear model that enables multi-step projection of a sampled feature and multi-step planning based on the simulated multi-step transition experience. We propose two multi-step linear models. The first iterates the one-step linear model, but is generally computationally complex. The second interpolates between the one-step model and the infinite-step model (which turns out to be the LSTD solution), and can be learned efficiently online. Policy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than single-step linear Dyna, and generally learns faster as the number of projection steps increases. Results on Mountain-car show that multi-step linear Dyna leads to much better online performance than single-step linear Dyna and model-free algorithms; however, the performance of multi-step linear Dyna does not always improve as the number of projection steps increases. Our results also suggest that previous attempts on extending LSTD for online control were unsuccessful because LSTD looks infinite steps into the future, and suffers from the model errors in non-stationary (control) environments.
2 0.67976952 178 nips-2009-On Stochastic and Worst-case Models for Investing
Author: Elad Hazan, Satyen Kale
Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1
3 0.6178754 78 nips-2009-Efficient Moments-based Permutation Tests
Author: Chunxiao Zhou, Huixia J. Wang, Yongmei M. Wang
Abstract: In this paper, we develop an efficient moments-based permutation test approach to improve the test s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency. 1 In t ro d u c t i o n Permutation tests are flexible nonparametric alternatives to parametric tests in small samples, or when the distribution of a test statistic is unknown or mathematically intractable. In permutation tests, except exchangeability, no other statistical assumptions are required. The p-values can be obtained by using the permutation distribution. Permutation tests are appealing in many biomedical studies, which often have limited observations with unknown distribution. They have been used successfully in structural MR image analysis [1, 2, 3], in functional MR image analysis [4], and in 3D face analysis [5]. There are three common approaches to construct the permutation distribution [6, 7, 8]: (1) exact permutation enumerating all possible arrangements; (2) approximate permutation based on random sampling from all possible permutations; (3) approximate permutation using the analytical moments of the exact permutation distribution under the null hypothesis. The main disadvantage of the exact permutation is the computational cost, due to the factorial increase in the number of permutations with the increasing number of subjects. The second technique often gives inflated type I errors caused by random sampling. When a large number of repeated tests are needed, the random permutation strategy is also computationally expensive to achieve satisfactory accuracy. Regarding the third approach, the exact permutation distribution may not have moments or moments with tractability. In most applications, it is not the existence but the derivation of moments that limits the third approach. To the best of our knowledge, there is no systematic and efficient way to derive the moments of the permutation distribution. Recently, Zhou [3] proposed a solution by converting the permutation of data to that of the statistic coefficients that are symmetric to the permutation. Since the test statistic coefficients usually have simple presentations, it is easier to track the permutation of the test statistic coefficients than that of data. However, this method requires the derivation of the permutation for each specific test statistic, which is not accessible to practical users. In this paper, we propose a novel strategy by employing a general theoretical method to derive the moments of the permutation distribution of any weighted v-statistics, for both univariate and multivariate data. We note that any moments of the permutation distribution for weighted v-statistics [9] can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Our key idea is to divide the whole index set into several permutation equivalent (see Definition 2) index subsets such that the summation of the data/index function term over all permutations is invariant within each subset and can be calculated without conducting any permutation. Then we can obtain the moments by summing up several subtotals. The proposed method can be extended to equivalent weighted v-statistics by replacing them with monotonic weighted v-statistics. This is due to the fact that only the order of test statistics of all permutations matters for obtaining the p-values, so that the monotonic weighted v-statistics shares the same p-value with the original test statistic. Given the first four moments, the permutation distribution can be well fitted by Pearson distribution series. The p-values are then obtained without conducting any real permutation. For multiple comparison of two-group difference, given the sample size n1 = 21 and n2 = 21, the number of tests m = 2,000, we need to conduct m×(n1 + n2 )!/n1 !/n2 ! 1.1×1015 permutations for the exact permutation test. Even for 20,000 random permutations per test, we still need m×20,000 4×107 permutations. Alternatively, our moments-based permutation method using Pearson distribution approximation only involves the calculation of the first four analytically-derived moments of exact permutation distributions to achieve high accuracy (see section 3). Instead of calculating test statistics in factorial scale with exact permutation, our moments-based permutation only requires computation of polynomial order. For example, the computational cost for univariate mean difference test statistic and modified multivariate Hotelling's T2 test statistics [8] are O(n) and O(n3 ), respectively, where n = n 1 + n2 . 2 M e t h o d o lo g y In this section, we shall mainly discuss how to calculate the moments of the permutation distribution for weighted v-statistics. For other test statistics, a possible solution is to replace them with their equivalent weighted v-statistics by monotonic transforms. The detailed discussion about equivalent test statistics can be found in [7, 8, 10]. 2.1 C o m p ut a t i o n a l c h a l l e n g e Let us first look at a toy example. Suppose we have a two-group univariate data x = ( x1 , L , xn1 , xn1 +1 ,L , xn1 + n2 ) , where the first n1 elements are in group A and the rest, n2 ,are in group B. For comparison of the two groups, the hypothesis is typically constructed as: H 0 : m A = mB vs. H a : m A ¹ m B , where m A , m B are the population means of the groups A n1 n i =1 i = n1 +1 and B, respectively. Define x A = å xi / n1 and xB = å xi / n2 as the sample means of two groups, where n=n1+n2. We choose the univariate group mean difference statistic, i.e., n T ( x) = x A - xB = å w(i ) xi , i =1 where the index function as the test w(i ) = 1/ n1 , if i Î {1, L , n1} and w(i ) = -1/ n2 , if i Î {n1 + 1, L, n} . Then the total number of all possible permutations of {1, L, n} is n!. To calculate the fourth moment of the permutation distribution, n n n n n 1 1 4 å ( å w(i ) xp (i ) ) = å å å å å w(i1 ) w(i2 ) w(i3 ) w(i4 )xp ( i1 ) xp ( i2 ) xp ( i3 ) xp ( i4 ) , n ! p ÎS n i =1 n ! p ÎS n i1 =1 i2 =1 i3 =1 i4 =1 where is the permutation operator and the symmetric group Sn [11] includes all distinct permutations. The above example shows that the moment calculation can be considered as a summation over all possible permutations and a large index set. It is noticeable that the computational challenge here is to go through the factorial level permutations and polynomial level indices. Ep (T 4 ( x ))= 2.2 P a r t i t i o n t h e i n de x s e t In this paper, we assume that the test statistic T can be expressed as a weighted v-statistic of n n i1 =1 id =1 degree d [9], that is, T ( x) = å L å w(i1 , L , id ) h( xi1 ,L , xid ) , where x = ( x1 , x 2 , L , x n ) T is a data with n observations, and w is a symmetric index function. h is a symmetric data function, i.e., invariant under permutation of (i1 ,L , id ) . Though the symmetry property is not required for our method, it helps reduce the computational cost. Here, each observation xk can be either univariate or multivariate. In the above toy example, d=1 and h is the identity function. Therefore, the r-th moment of the test statistic from the permutated data is: Ep (T r ( x)) = Ep ( å w(i1 ,L , id )h( xp (i1 ) ,L , xp ( id ) ))r i1 ,i2 ,L,id = Ep [ å i1(1) ,L, id (1) , r r k =1 k =1 { Õ w(i1(k ) ,L, id ( k ) ) Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) )}] . L i1( r ) ,L,id ( r ) d 1 Then we can exchange the summation order of permutations and that of indices, Ep (T r ( x)) = å i1(1) ,L, id (1) , L i1( r ) ,L,id ( r ) r r k =1 k =1 {( Õ w(i1( k ) ,L , id ( k ) )) Ep ( Õ h( xp (i ( k ) ) ,L, xp (i ( k ) ) ))}. d 1 Thus any moment of permutation distribution can be considered as a summation of the product of data function term and index function term over a high dimensional index set and all possible permutations. Since all possible permutations map any index value between 1 and n to all possible index r values from 1 to n with equal probability, Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) , the summation of k =1 1 d data function over all permutations is only related to the equal/unequal relationship among indices. It is natural to divide the whole index set U = {i1 ,L , id }r = {(i1(1) , L , id (1) ), L (r ) , ( i1 , L , id (r ) r )} into the union of disjoint index subsets, in which Ep ( Õ h( xp (i ( k ) ) ,L , xp (i ( k ) ) )) k =1 1 d is invariant. Definition 1. Since h is a symmetric function, two index elements (i1 ,L , id ) and ( j1 ,L , jd ) are said to be equivalent if they are the same up to the order. For example, for d = 3, (1, 4, 5) = (1,5,4) = (4,1,5) = (4,5,1) = (5,1,4) = (5,4,1). Definition 2. Two indices {(i1(1) , L , id (1) ), L , (i1( r ) , L , id ( r ) )} and {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} are said to be permutation equivalent/ if there exists a permutation p Î Sn such that {(p (i1(1) ), L , p (id (1) )), L , (p (i1( r ) ), L , p (id ( r ) ))} = {( j1(1) , L , jd (1) ), L , ( j1( r ) , L , jd ( r ) )} . Here
4 0.58129925 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning
Author: Anne Hsu, Thomas L. Griffiths
Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-specific knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1
5 0.57998377 242 nips-2009-The Infinite Partially Observable Markov Decision Process
Author: Finale Doshi-velez
Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
6 0.56389165 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
7 0.54825377 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
8 0.53724384 33 nips-2009-Analysis of SVM with Indefinite Kernels
9 0.47531837 134 nips-2009-Learning to Explore and Exploit in POMDPs
10 0.47432911 113 nips-2009-Improving Existing Fault Recovery Policies
11 0.47173652 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
12 0.47081771 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
13 0.46317035 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
14 0.45946908 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
15 0.45575115 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
16 0.45348078 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
17 0.44225463 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
18 0.43848374 3 nips-2009-AUC optimization and the two-sample problem
19 0.43490419 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
20 0.43422815 48 nips-2009-Bootstrapping from Game Tree Search