nips nips2010 nips2010-212 knowledge-graph by maker-knowledge-mining

212 nips-2010-Predictive State Temporal Difference Learning


Source: pdf

Author: Byron Boots, Geoffrey J. Gordon

Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. [sent-6, score-0.264]

2 In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. [sent-7, score-0.131]

3 Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. [sent-8, score-0.316]

4 By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. [sent-9, score-0.132]

5 As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. [sent-12, score-0.745]

6 1 Introduction We wish to estimate the value function of a policy in an unknown decision process in a high dimensional and partially-observable environment. [sent-16, score-0.214]

7 We represent the value function in a linear architecture, as a linear combination of features of (sequences of) observations. [sent-17, score-0.181]

8 In the current paper, we build on this insight, while simultaneously finding a compact set of features using powerful methods from system identification. [sent-28, score-0.175]

9 First, we look at the problem of improving LSTD from a model-free predictive-bottleneck perspective: given a large set of features of history, we devise a new TD method called Predictive State Temporal Difference (PSTD) learning. [sent-29, score-0.146]

10 PSTD estimates the value function through a bottleneck that 1 preserves only predictive information (Section 3). [sent-30, score-0.277]

11 Instead of learning a linear transition model in feature space, as in [5], we use subspace identification [6, 7] to learn a PSR from our samples. [sent-32, score-0.125]

12 This result yields some appealing theoretical benefits: for example, PSTD features can be explicitly interpreted as a statistically consistent estimate of the true underlying system state. [sent-35, score-0.229]

13 And, the feasibility of finding the true value function can be shown to depend on the linear dimension of the dynamical system, or equivalently, the dimensionality of the predictive state representation—not on the cardinality of the POMDP state space. [sent-36, score-0.419]

14 We show that, if we add a large number of weakly relevant features to these hand-tuned features, PSTD can find a predictive subspace which performs much better than competing approaches, improving on the best previously reported result for this problem by a substantial margin. [sent-40, score-0.417]

15 2 Value Function Approximation We start from a discrete time dynamical system with a set of states S, a set of actions A, a distribution over initial states π0 , a transition function T , a reward function R, and a discount factor γ ∈ [0, 1]. [sent-42, score-0.289]

16 We seek a policy π, a mapping from states to actions. [sent-43, score-0.183]

17 For a given policy π, the value of state s is defined as the expected discounted sum of rewards when starting in state s and following policy π, ∞ J π (s) = E [ t=0 γ t R(st ) | s0 = s, π]. [sent-44, score-0.572]

18 However, we consider instead the harder problem of estimating the value function when s is a partially observable latent variable, and when the transition function T is unknown. [sent-46, score-0.123]

19 We can no longer make decisions or predict reward based on S, but instead must use a history (an ordered sequence of action-observation pairs h = ah oh . [sent-48, score-0.23]

20 H is often very large or infinite, so instead of finding a value separately for each history, we focus on value functions that are linear in features of histories J π (h) = wT φH (h) (2) Here w ∈ Rj is a parameter vector and φH (h) ∈ Rj is a feature vector for a history h. [sent-53, score-0.449]

21 So, we can rewrite the Bellman equation as wT φH (h) = R(h) + γ o∈O wT φH (hπo) Pr[hπo | hπ] (3) where hπo is history h extended by taking action π(h) and observing o. [sent-54, score-0.174]

22 1 Least Squares Temporal Difference Learning In general we don’t know the transition probabilities Pr[hπo | h], but we do have samples of state features φH = φH (ht ), next-state features φH = φH (ht+1 ), and immediate rewards Rt = R(ht ). [sent-56, score-0.477]

23 T As the amount of data k increases, the empirical covariance matrices φH φH /k and 1:k 1:k T φH φH /k converge with probability 1 to their population values, and so our estimate of the 2:k+1 1:k matrix to be inverted in (5) is consistent. [sent-67, score-0.158]

24 3 Predictive Features LSTD provides a consistent estimate of the value function parameters w; but in practice, if the number of features is large relative to the number of training samples, then the LSTD estimate of w is prone to overfitting. [sent-69, score-0.233]

25 This problem can be alleviated by choosing a small set of features that only contains information that is relevant for value function approximation. [sent-70, score-0.212]

26 However, with the exception of LARS-TD [9], there has been little work on how to select features automatically for value function approximation when the system model is unknown; and of course, manual feature selection depends on not-always-available expert guidance. [sent-71, score-0.238]

27 We approach the problem of finding a good set of features from a bottleneck perspective. [sent-72, score-0.18]

28 That is, given a large set of features of history, we would like to find a compression that preserves only relevant information for predicting the value function J π . [sent-73, score-0.358]

29 1 Finding Predictive Features Through a Bottleneck In order to find a predictive feature compression, we first need to determine what we would like to predict. [sent-76, score-0.2]

30 The most relevant prediction is the value function itself; so, we could simply try to predict total future discounted reward. [sent-77, score-0.161]

31 We call these prediction tasks, collectively, features of the future. [sent-82, score-0.175]

32 We write φT for the vector of t all features of the “future at time t,” i. [sent-83, score-0.174]

33 Now, instead of remembering a large arbitrary set of features of history, we want to find a small subspace of features of history that is relevant for predicting features of the future. [sent-86, score-0.638]

34 We will call this subspace a predictive compression, and we will write the value function as a linear function of only the predictive compression of features. [sent-87, score-0.585]

35 To find our predictive compression, we will use reducedrank regression [10]. [sent-88, score-0.172]

36 We define the following empirical covariance matrices between features of the future and features of histories: ΣT ,H = 1 k k t=1 φT φH t t T ΣH,H = 1 k k t=1 φH φH t t T (6) Let LH be the lower triangular Cholesky factor of ΣH,H . [sent-89, score-0.414]

37 2, scaling up future reward by a constant factor results in a value-directed compression—but, unlike previous ways to find valuedirected compressions [8], we do not need to know a model of our system ahead of time. [sent-96, score-0.15]

38 For another example, let LT be the Cholesky factor of the empirical covariance of future features ΣT ,T . [sent-97, score-0.238]

39 Then, if we scale features of the future by L−T , the SVD will preserve the largest possible amount T of mutual information between history and future, yielding a canonical correlation analysis [13, 14]. [sent-98, score-0.311]

40 4 Predictive State Representations A predictive state representation (PSR) [15] is a compact and complete description of a dynamical system. [sent-104, score-0.299]

41 Unlike POMDPs, which represent state as a distribution over a latent variable, PSRs represent state as a set of predictions of tests. [sent-105, score-0.208]

42 Just as a history is an ordered sequence of actionobservation pairs executed prior to time t, we define a test of length i to be an ordered sequence of action-observation pairs τ = a1 o1 . [sent-106, score-0.126]

43 The prediction for a test τ after a history h, written τ (h), is the probability that we will see the test observations τ O = o1 . [sent-110, score-0.13]

44 Finally, a core set Q is minimal if the tests in Q are linearly independent [17, 18], i. [sent-132, score-0.158]

45 In addition to the above PSR param∞ eters, for reinforcement learning we need a reward function R(h) = η T Q(h) mapping predictive states to immediate rewards, a discount factor γ ∈ [0, 1] which weights the importance of future rewards vs. [sent-139, score-0.464]

46 present ones, and a policy π(Q(h)) mapping from predictive states to actions. [sent-140, score-0.355]

47 TPSRs have exactly the same predictive abilities as regular PSRs, but are invariant under similarity transforms: given an invertible matrix S, we can transform m1 → Sm1 , mT → mT S −1 , and Mao → SMao S −1 without changing the ∞ ∞ corresponding dynamical system, since pairs S −1 S cancel in Eq. [sent-144, score-0.265]

48 Then, let T be a larger core set of tests (not necessarily minimal), and let H be the set of all possible histories. [sent-150, score-0.158]

49 As before, write φH ∈ R for a vector of features of history at time t, and write φT ∈ R for a vector t t of features of the future at time t. [sent-151, score-0.489]

50 And, since feature predictions are linear combinations of test predictions, we can also compute any feature prediction φ(h) as a linear function of T (h). [sent-153, score-0.123]

51 We define the matrix ΦT ∈ R ×|T | to embody our predictions of future features: an entry of ΦT is the weight of one of the tests in T for calculating the prediction of one of the features in φT . [sent-154, score-0.43]

52 T First we define ΣH,H , the covariance matrix of features of histories, as E[φH φH | ht ∼ ω]. [sent-158, score-0.301]

53 Next we define ΣS,H , the cross covariance of states and features of histories. [sent-161, score-0.228]

54 We cannot directly estimate ΣS,H from state at time t, let ΣS,H = E k s1:k φH 1:k data, but this matrix will appear as a factor in several of the matrices that we define below. [sent-163, score-0.167]

55 By do(ζ), we mean to approximate the effect of executing all sequences of actions required by all tests or features of the future at once. [sent-165, score-0.331]

56 This is not difficult in our experiments (in which all tests use compatible action sequences); but see [12] for further discussion. [sent-166, score-0.165]

57 Next we define ΣH,ao,H , a set of matrices, one for each action-observation pair, that represent the covariance between features of history before and after taking action a and observing o. [sent-171, score-0.343]

58 Finally we define ΣR,H , and approximate the covariance (in this case a vector) of reward and features of history: 5 ΣR,H ≡ 1 k k t=1 Rt φH t T T ΣR,H ≡ E[Rt φH | ht ∼ ω] = η T ΣS,H t (12d) Again, as k → ∞, ΣR,H converges to ΣR,H with probability 1. [sent-174, score-0.356]

59 To do so we need to make a somewhat-restrictive assumption: we assume that our features of history are rich enough to determine the state of the system, i. [sent-176, score-0.332]

60 For a fixed policy π, a PSR or TPSR’s value function is a linear function of state, V (s) = wT s, and is the solution of the PSR Bellman equation [22]: for all s, wT s = bT s + γ o∈O wT Bπo s, or equivalently, wT = η bT + γ o∈O wT Bπo . [sent-196, score-0.217]

61 The problem is to find the value function of a policy in a partially observable Markov decision Process (POMDP). [sent-212, score-0.221]

62 (B) Estimating the value function with a small set of informative features and a large set of random features. [sent-230, score-0.181]

63 PSTD is able to determine a small set of compressed features that retain the maximal amount of information about the value function, outperforming LSTD and LARS-TD. [sent-233, score-0.241]

64 We perform 3 experiments, comparing the performance of LSTD, LARS-TD, and PSTD when different sets of features are used. [sent-236, score-0.146]

65 In the first experiment we execute the policy π for 1000 time steps. [sent-238, score-0.153]

66 We split the data into overlapping histories and tests of length 5, and sample 10 of these histories and tests to serve as centers for Gaussian radial basis functions. [sent-239, score-0.45]

67 For the second experiment, we added 490 random, uninformative features to the 10 good features and then attempted to learn the value function with each of the 3 algorithms (Figure 1(B)). [sent-243, score-0.327]

68 LARS-TD, designed for precisely this scenario, was able to select the 10 relevant features and estimate the value function better by a substantial margin. [sent-245, score-0.238]

69 For the third experiment, we increased the number of sampled features from 10 to 500. [sent-246, score-0.146]

70 In this case, each feature was somewhat relevant, but the number of features was large compared to the amount of training data. [sent-247, score-0.198]

71 This situation occurs frequently in practice: it is often easy to find a large number of features that are at least somewhat related to state. [sent-248, score-0.146]

72 PSTD outperforms LSTD and LARS-TD by summarizing these features and efficiently estimating the value function (Figure 1(C)). [sent-249, score-0.207]

73 In some derivatives the contract holder has no choices, but in more complex cases, the holder must make decisions, and the value of the contract depends on how the holder acts—e. [sent-252, score-0.421]

74 , with early exercise the holder can decide to terminate the contract at any time and receive payments based on prevailing market conditions, so deciding when to exercise is an optimal stopping problem. [sent-254, score-0.305]

75 Stopping problems provide an ideal testbed for policy evaluation methods, since we can collect a single data set which lets us evaluate any policy: we just choose the “continue” action forever. [sent-255, score-0.197]

76 At exercise the holder receives a payoff equal to the current price of the stock divided by the price 100 days beforehand. [sent-260, score-0.337]

77 We can think of this derivative as a “psychic call”: the holder gets to decide whether s/he would like to have bought an ordinary 100-day European call option, at the then-current market price, 100 days ago. [sent-261, score-0.192]

78 Assuming stock prices fluctuate only on days when the market is open, these parameters correspond to an annual growth rate of ∼ 10%. [sent-265, score-0.202]

79 In more detail, if wt is a standard Brownian motion, then the stock price pt evolves as pt = ρpt t + σpt wt , and we can summarize relevant state at the end of each day as a vector 7 pt−99 pt−98 pt xt ∈ R100 , with xt = ( pt−100 , pt−100 , . [sent-266, score-0.921]

80 The immediate reward for exercising the option is G(x) = x(100), and the immediate reward for continuing to hold the option is 0. [sent-271, score-0.257]

81 The discount factor γ = e−ρ is determined by the growth rate; this corresponds to assuming that the risk-free interest rate is equal to the stock’s growth rate, meaning that the investor gains nothing in expectation by holding the stock itself. [sent-272, score-0.229]

82 Our goal is to calculate an approximate value function V (x) = wT φH (x), and then use this value function to generate a stopping time min{t | G(xt ) ≥ V (xt )}. [sent-274, score-0.127]

83 To do so, we sample a sequence of 1,000,000 states xt ∈ R100 and calculate features φH of each state. [sent-275, score-0.225]

84 We then perform policy iteration on this sample, alternately estimating the value function under a given policy and then using this value function to define a new greedy policy “stop if G(xt ) ≥ wT φH (xt ). [sent-276, score-0.555]

85 ” Within the above strategy, we have two main choices: which features do we use, and how do we estimate the value function in terms of these features. [sent-277, score-0.207]

86 In each case we re-used our 1,000,000-state sample trajectory for all iterations: we start at the beginning and follow the trajectory as long as the policy chooses the “continue” action, with reward 0 at each step. [sent-279, score-0.234]

87 When the policy executes the “stop” action, the reward is G(x) and the next state’s features are all 0; we then restart the policy 100 steps in the future, after the process has fully mixed. [sent-280, score-0.533]

88 For feature selection, we are fortunate: previous researchers have hand-selected a “good” set of 16 features for this data set through repeated trial and error (see [12] and [24, 25]). [sent-281, score-0.174]

89 Specifically, we add the entire 100-step state vector, the squares of the components of the state vector, and several additional nonlinear features, increasing the total number of features from 16 to 220. [sent-283, score-0.347]

90 We use histories of length 1, tests of length 5, and (for comparison’s sake) we choose a linear dimension of 16. [sent-284, score-0.225]

91 Tests (but not histories) were value-directed by reducing the variance of all features except reward by a factor of 100. [sent-285, score-0.227]

92 We compared PSTD (reducing 220 to 16 features) to LSTD with either the 16 hand-selected features or the full 220 features, as well as to LARS-TD (220 features) and to a simple thresholding strategy [24]. [sent-287, score-0.146]

93 In each case we evaluated the final policy on 10,000 new random trajectories. [sent-288, score-0.153]

94 6 Conclusion In this paper, we attack the feature selection problem for temporal difference learning. [sent-294, score-0.143]

95 PSTD automatically chooses a small set of features that are relevant for prediction and value function approximation. [sent-298, score-0.241]

96 It approaches feature selection from a bottleneck perspective, by finding a small set of features that preserves only predictive information. [sent-299, score-0.416]

97 Because of the focus on predictive information, the PSTD approach is closely connected to PSRs: under appropriate assumptions, PSTD’s compressed set of features is asymptotically equivalent to TPSR state, and PSTD is a consistent estimator of the PSR value function. [sent-300, score-0.389]

98 Regularization and feature selection in least-squares temporal difference learning. [sent-350, score-0.143]

99 Predictive state representations: A new theory for modeling dynamical systems. [sent-385, score-0.127]

100 Improving approximate value iteration using memories and predictive state representations. [sent-402, score-0.292]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pstd', 0.674), ('lstd', 0.254), ('psr', 0.192), ('predictive', 0.172), ('wt', 0.171), ('psrs', 0.168), ('policy', 0.153), ('tpsr', 0.152), ('features', 0.146), ('mao', 0.127), ('tests', 0.121), ('compression', 0.11), ('histories', 0.104), ('holder', 0.104), ('history', 0.101), ('state', 0.085), ('tpsrs', 0.084), ('temporal', 0.083), ('reward', 0.081), ('bellman', 0.079), ('ht', 0.077), ('stock', 0.072), ('pt', 0.072), ('subspace', 0.068), ('boots', 0.067), ('investor', 0.067), ('td', 0.062), ('byron', 0.059), ('stopping', 0.057), ('geoffrey', 0.052), ('covariance', 0.052), ('xt', 0.049), ('price', 0.049), ('svd', 0.046), ('reinforcement', 0.046), ('action', 0.044), ('dynamical', 0.042), ('prices', 0.041), ('pricing', 0.041), ('future', 0.04), ('predictions', 0.038), ('exercise', 0.038), ('core', 0.037), ('pr', 0.037), ('contract', 0.037), ('identi', 0.037), ('compressed', 0.036), ('immediate', 0.036), ('preserves', 0.036), ('value', 0.035), ('rewards', 0.035), ('bottleneck', 0.034), ('bt', 0.034), ('ssid', 0.034), ('parr', 0.033), ('growth', 0.033), ('observable', 0.033), ('derivative', 0.032), ('pomdp', 0.032), ('difference', 0.032), ('squares', 0.031), ('relevant', 0.031), ('singular', 0.031), ('market', 0.031), ('states', 0.03), ('matrices', 0.03), ('embody', 0.03), ('sajid', 0.03), ('siddiqi', 0.03), ('prediction', 0.029), ('system', 0.029), ('transition', 0.029), ('equation', 0.029), ('feature', 0.028), ('write', 0.028), ('mt', 0.028), ('nancial', 0.028), ('statistically', 0.028), ('rl', 0.028), ('day', 0.028), ('instrumental', 0.027), ('satinder', 0.027), ('rt', 0.027), ('discounted', 0.026), ('estimating', 0.026), ('matrix', 0.026), ('estimate', 0.026), ('representations', 0.026), ('oh', 0.025), ('executed', 0.025), ('days', 0.025), ('transform', 0.025), ('discount', 0.024), ('payoffs', 0.024), ('uncorrelated', 0.024), ('amount', 0.024), ('actions', 0.024), ('stop', 0.023), ('ah', 0.023), ('continuing', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 212 nips-2010-Predictive State Temporal Difference Learning

Author: Byron Boots, Geoffrey J. Gordon

Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1

2 0.24619505 134 nips-2010-LSTD with Random Projections

Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos

Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1

3 0.15288159 208 nips-2010-Policy gradients in linearly-solvable MDPs

Author: Emanuel Todorov

Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.

4 0.1447138 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi

Abstract: In this paper, we propose an efficient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods. 1

5 0.13414165 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1

6 0.13372165 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

7 0.13229749 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

8 0.1311592 152 nips-2010-Learning from Logged Implicit Exploration Data

9 0.12253992 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

10 0.12075565 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

11 0.11949635 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

12 0.11886188 43 nips-2010-Bootstrapping Apprenticeship Learning

13 0.097634494 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

14 0.09391962 168 nips-2010-Monte-Carlo Planning in Large POMDPs

15 0.091432899 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

16 0.086597659 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

17 0.083236255 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations

18 0.081301741 229 nips-2010-Reward Design via Online Gradient Ascent

19 0.079470411 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

20 0.077207319 19 nips-2010-A rational decision making framework for inhibitory control


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.228), (2, -0.039), (3, -0.009), (4, 0.019), (5, -0.021), (6, 0.035), (7, 0.021), (8, -0.056), (9, -0.035), (10, -0.006), (11, 0.013), (12, 0.039), (13, 0.021), (14, 0.024), (15, 0.005), (16, 0.033), (17, -0.041), (18, 0.002), (19, 0.058), (20, -0.065), (21, 0.114), (22, 0.01), (23, -0.006), (24, -0.008), (25, 0.009), (26, 0.001), (27, 0.02), (28, -0.026), (29, 0.002), (30, 0.003), (31, 0.047), (32, -0.055), (33, 0.112), (34, 0.021), (35, 0.064), (36, 0.041), (37, 0.048), (38, -0.013), (39, -0.061), (40, 0.026), (41, -0.039), (42, -0.0), (43, -0.045), (44, 0.039), (45, -0.093), (46, -0.008), (47, -0.07), (48, -0.149), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92701054 212 nips-2010-Predictive State Temporal Difference Learning

Author: Byron Boots, Geoffrey J. Gordon

Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1

2 0.71151006 134 nips-2010-LSTD with Random Projections

Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos

Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1

3 0.68453556 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

Author: Atsushi Miyamae, Yuichi Nagata, Isao Ono, Shigenobu Kobayashi

Abstract: In this paper, we propose an efficient algorithm for estimating the natural policy gradient using parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods. 1

4 0.68079567 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos

Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1

5 0.67782086 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr

Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1

6 0.61266083 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

7 0.6125924 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

8 0.61072183 208 nips-2010-Policy gradients in linearly-solvable MDPs

9 0.59271055 152 nips-2010-Learning from Logged Implicit Exploration Data

10 0.59265751 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

11 0.56709152 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

12 0.56567627 64 nips-2010-Distributionally Robust Markov Decision Processes

13 0.56145388 43 nips-2010-Bootstrapping Apprenticeship Learning

14 0.53677005 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

15 0.53463936 19 nips-2010-A rational decision making framework for inhibitory control

16 0.53005564 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

17 0.52036077 66 nips-2010-Double Q-learning

18 0.50371128 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

19 0.50167984 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

20 0.49858773 168 nips-2010-Monte-Carlo Planning in Large POMDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.04), (16, 0.228), (17, 0.016), (27, 0.062), (30, 0.055), (45, 0.234), (50, 0.065), (52, 0.064), (53, 0.028), (60, 0.026), (71, 0.011), (77, 0.032), (90, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86887628 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

Author: Kaushik Mitra, Sameer Sheorey, Rama Chellappa

Abstract: Matrix factorization in the presence of missing data is at the core of many computer vision problems such as structure from motion (SfM), non-rigid SfM and photometric stereo. We formulate the problem of matrix factorization with missing data as a low-rank semidefinite program (LRSDP) with the advantage that: 1) an efficient quasi-Newton implementation of the LRSDP enables us to solve large-scale factorization problems, and 2) additional constraints such as orthonormality, required in orthographic SfM, can be directly incorporated in the new formulation. Our empirical evaluations suggest that, under the conditions of matrix completion theory, the proposed algorithm finds the optimal solution, and also requires fewer observations compared to the current state-of-the-art algorithms. We further demonstrate the effectiveness of the proposed algorithm in solving the affine SfM problem, non-rigid SfM and photometric stereo problems.

2 0.8525303 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions

Author: Silvia Chiappa, Jan R. Peters

Abstract: Many time-series such as human movement data consist of a sequence of basic actions, e.g., forehands and backhands in tennis. Automatically extracting and characterizing such actions is an important problem for a variety of different applications. In this paper, we present a probabilistic segmentation approach in which an observed time-series is modeled as a concatenation of segments corresponding to different basic actions. Each segment is generated through a noisy transformation of one of a few hidden trajectories representing different types of movement, with possible time re-scaling. We analyze three different approximation methods for dealing with model intractability, and demonstrate how the proposed approach can successfully segment table tennis movements recorded using a robot arm as haptic input device. 1

3 0.84888536 219 nips-2010-Random Conic Pursuit for Semidefinite Programming

Author: Ariel Kleiner, Ali Rahimi, Michael I. Jordan

Abstract: We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) via repeated optimization over randomly selected two-dimensional subcones of the PSD cone. This scheme is simple, easily implemented, applicable to very general SDPs, scalable, and theoretically interesting. Its advantages are realized at the expense of an ability to readily compute highly exact solutions, though useful approximate solutions are easily obtained. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are generally based upon random data and so exact minima are often not a priority. Indeed, we present empirical results to this effect for various SDPs encountered in machine learning; these experiments demonstrate the potential practical usefulness of Random Conic Pursuit. We also provide a preliminary analysis that yields insight into the theoretical properties and convergence of the algorithm. 1

4 0.82155937 1 nips-2010-(RF)^2 -- Random Forest Random Field

Author: Nadia Payet, Sinisa Todorovic

Abstract: We combine random forest (RF) and conditional random field (CRF) into a new computational framework, called random forest random field (RF)2 . Inference of (RF)2 uses the Swendsen-Wang cut algorithm, characterized by MetropolisHastings jumps. A jump from one state to another depends on the ratio of the proposal distributions, and on the ratio of the posterior distributions of the two states. Prior work typically resorts to a parametric estimation of these four distributions, and then computes their ratio. Our key idea is to instead directly estimate these ratios using RF. RF collects in leaf nodes of each decision tree the class histograms of training examples. We use these class histograms for a nonparametric estimation of the distribution ratios. We derive the theoretical error bounds of a two-class (RF)2 . (RF)2 is applied to a challenging task of multiclass object recognition and segmentation over a random field of input image regions. In our empirical evaluation, we use only the visual information provided by image regions (e.g., color, texture, spatial layout), whereas the competing methods additionally use higher-level cues about the horizon location and 3D layout of surfaces in the scene. Nevertheless, (RF)2 outperforms the state of the art on benchmark datasets, in terms of accuracy and computation time.

same-paper 5 0.81486213 212 nips-2010-Predictive State Temporal Difference Learning

Author: Byron Boots, Geoffrey J. Gordon

Abstract: We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1

6 0.74989045 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

7 0.74958616 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

8 0.7466439 287 nips-2010-Worst-Case Linear Discriminant Analysis

9 0.7461794 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

10 0.74567896 148 nips-2010-Learning Networks of Stochastic Differential Equations

11 0.745408 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

12 0.74529052 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs

13 0.74504811 63 nips-2010-Distributed Dual Averaging In Networks

14 0.74495161 280 nips-2010-Unsupervised Kernel Dimension Reduction

15 0.74458599 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

16 0.74407548 103 nips-2010-Generating more realistic images using gated MRF's

17 0.74368525 217 nips-2010-Probabilistic Multi-Task Feature Selection

18 0.74332869 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions

19 0.74315846 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

20 0.74312073 158 nips-2010-Learning via Gaussian Herding