nips nips2007 nips2007-86 knowledge-graph by maker-knowledge-mining

86 nips-2007-Exponential Family Predictive Representations of State


Source: pdf

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. [sent-3, score-0.324]

2 Predictive representations of state (PSRs) capture state as statistics of the future. [sent-4, score-0.366]

3 We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. [sent-5, score-0.613]

4 This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. [sent-6, score-0.306]

5 1 Introduction One of the basic problems in modeling controlled, partially observable, stochastic dynamical systems is representing and tracking state. [sent-9, score-0.201]

6 In a reinforcement learning context, the state of the system is important because it can be used to make predictions about the future, or to control the system optimally. [sent-10, score-0.186]

7 Often, state is viewed as an unobservable, latent variable, but models with predictive representations of state [4] propose an alternative: PSRs represent state as statistics about the future. [sent-11, score-0.528]

8 The original PSR models used the probability of specific, detailed futures called tests as the statistics of interest. [sent-12, score-0.113]

9 Recent work has introduced the more general notion of using parameters that model the distribution of length n futures as the statistics of interest [8]. [sent-13, score-0.186]

10 ot , which we call a history ht (where subscripts denote time). [sent-18, score-0.952]

11 We emphasize that this distribution directly models observable quantities in the system. [sent-23, score-0.149]

12 Instead of capturing state with tests, the more general idea is to capture state by directly modeling the distribution p(F n |ht ). [sent-24, score-0.362]

13 Our central assumption is that the parameters describing p(F n |ht ) are sufficient for history, and therefore constitute state (as the agent interacts with the system, p(F n |ht ) changes because ht changes; therefore the parameters and hence state change). [sent-25, score-0.914]

14 As an example of this, the Predictive Linear-Gaussian (PLG) model [8] assumes that p(F n |ht ) is jointly Gaussian; state therefore becomes its mean and covariance. [sent-26, score-0.166]

15 Nothing is lost by defining state in terms of observable quantities: Rudary et al [8] proved that the PLG is formally equivalent to the latent-variable approach in linear dynamical systems. [sent-27, score-0.313]

16 1 Thus, as part of capturing state in a dynamical system in our method, p(F n |ht ) must be estimated. [sent-29, score-0.279]

17 In this paper, we introduce the Exponential Family PSR (EFPSR) which assumes that p(F n |ht ) is a standard exponential family distribution. [sent-34, score-0.198]

18 By selecting the sufficient statistics of the distribution carefully, we can impose graphical structure on p(F n |ht ), and therefore make explicit connections to graphical models, maximum entropy modeling, and Boltzmann machines. [sent-35, score-0.245]

19 The EFPSR inherits both the advantages and disadvantages of graphical exponential family models: inference and parameter learning in the model is generally hard, but all existing research on exponential family distributions is applicable (in particular, work on approximate inference). [sent-36, score-0.57]

20 Selecting the form of p(F n |ht ) and estimating its parameters to capture state is only half of the problem. [sent-37, score-0.222]

21 We must also model the dynamical component, which describes the way that the parameters vary over time (that is, how the parameters of p(F n |ht ) and p(F n |ht+1 ) are related). [sent-38, score-0.271]

22 We describe a method called “extend-and-condition,” which generalizes many state update mechanisms in PSRs. [sent-39, score-0.133]

23 Importantly, the EFPSR has no hidden variables, but can still capture state, which sets it apart from other graphical models of sequential data. [sent-40, score-0.128]

24 This is a consequence of the fact that the model is fully observed: all statistics of interest are directly related to observable quantities. [sent-43, score-0.12]

25 The next sections discuss the specifics of the central parts of the model: the state representation, and how we maintain that state. [sent-46, score-0.133]

26 1 Standard Exponential Family Distributions We first discuss exponential family distributions, which we use because of their close connections to maximum entropy modeling and graphical models. [sent-48, score-0.339]

27 We refer the reader to Jaynes [2] for detailed justification, but briefly, he states that the maximum entropy distribution “agrees with everything that is known, but carefully avoids assuming anything that is not known,” which “is the fundamental property which justifies its use for inference. [sent-49, score-0.154]

28 ” The standard exponential family distribution is the form of the maximum entropy distribution under certain constraints. [sent-50, score-0.307]

29 For a random variable X, a standard exponential family distribution has the form p(X = x; s) = exp{sT φ(x) − Z(s)}, where s is the canonical (or natural) vector of parameters and φ(x) is a vector of features of variable x. [sent-51, score-0.349]

30 By carefully selecting the features φ(x), graphical structure may be imposed on the distribution. [sent-54, score-0.175]

31 The EFPSR defines state as the parameters of an exponential family distribution modeling p(F n |ht ). [sent-57, score-0.43]

32 To emphasize that these parameters represent state, we will refer to them as st : p(F n = f n |ht ; st ) = exp s⊤ φ(f n ) − log Z(st ) , t (1) with both { φ(f n ), st } ∈ Rl×1 . [sent-58, score-0.897]

33 We emphasize that st changes with history, but φ(f n ) does not. [sent-59, score-0.297]

34 In addition to selecting the form of p(F n |ht ), there is a dynamical component: given the parameters of p(F n |ht ), how can we incorporate a new observation to find the parameters of p(F n |ht , ot+1 )? [sent-61, score-0.282]

35 We assume that we have the parameters of p(F n |ht ), denoted st . [sent-64, score-0.31]

36 We capture this with a new feature vector φ+ (f n+1 ) ∈ Rk×1 , and define the vector s+ ∈ Rk×1 to be the parameters associated with this feature vector. [sent-68, score-0.139]

37 t t t To define the dynamics, we define a function which maps the current state vector to the parameters of the extended distribution. [sent-70, score-0.179]

38 We call this the extension function: s+ = extend(st ; θ), where θ is a t vector of parameters controlling the extension function (and hence, the overall dynamics). [sent-71, score-0.174]

39 The extension function helps govern the kinds of dynamics that the model can capture. [sent-72, score-0.17]

40 For example, in the PLG family of work, a linear extension allows the model to capture linear dynamics [8], while a non-linear extension allows the model to capture non-linear dynamics [11]. [sent-73, score-0.461]

41 By extending and conditioning, we can maintain state for arbitrarily long periods. [sent-76, score-0.133]

42 Furthermore, for many choices of features and extension function, the overall extend-and-condition operation does not involve any inference, mean that tracking state is computationally efficient. [sent-77, score-0.279]

43 There is only one restriction on the extension function: we must ensure that after extending and conditioning the distribution, the resulting distribution can be expressed as: p(F n = f n |ht+1 ; st+1 ) = exp{s⊤ φ(f n ) − log Z(st+1 )}. [sent-78, score-0.187]

44 These different models are obtained with different choices of the features φ and the extension function, and are possible because many popular distributions (such as multinomials and Gaussians) are exponential family distributions [11]. [sent-85, score-0.346]

45 3 The Linear-Linear EFPSR We now choose specific features and extension function to generate an example model designed to be analytically tractable. [sent-86, score-0.15]

46 We select a linear extension function, and we carefully choose features so that conditioning is always a linear operation. [sent-87, score-0.231]

47 If the features impose graphical structure on the distribution, it is also equivalent to saying that the form of the graph does not change over time. [sent-93, score-0.2]

48 We assume that we have an undirected graph G which we will use to create the features in the vector φ(), and that we have another graph G+ which we will use to define the features in the vector φ+ (). [sent-98, score-0.208]

49 To use the graph to define our distribution, we will let entries in φ be conjunctions of atomic observation variables (like the standard Ising model): for i ∈ V , there will be some feature k in the vector such that φ(ft )k = fti . [sent-109, score-0.223]

50 Because all features are either atomic variables or conjunctions of variables, conditioning the distribution can be done with an operation which is linear in the state (this is true even if the random variables are discrete or real-valued). [sent-116, score-0.376]

51 The combination of a linear extension and a linear conditioning operator can be rolled together into a single operation. [sent-122, score-0.136]

52 Without loss of generality, we can permute the indices in our state vector such that st+1 = G(ot+1 ) (Ast + B). [sent-123, score-0.133]

53 There are two things which can be learned in our model: the structure of the graph, and the parameters governing the state update. [sent-127, score-0.216]

54 We assume we are given a sequence of T observations, [o1 · · · oT ], which we stack to create a sequence of samples from the F n |ht ’s: ft |ht = [ot+1 · · · ot+n |ht ]. [sent-129, score-0.17]

55 1 Structure Learning To learn the graph structure, we make the approximation of ignoring the dynamical component of the model. [sent-131, score-0.171]

56 That is, we treat each ft as an observation, and try to estimate the density of the resulting unordered set, ignoring the t subscripts (we appeal to density estimation because many good algorithms have been developed for structure induction). [sent-132, score-0.282]

57 For example, if observation a is always followed by observation b, this fact will be captured within the ft ’s. [sent-134, score-0.258]

58 2 Maximum Likelihood Parameter Estimation With the structure of the graph in place, we are left to learn the parameters A and B of the state extension. [sent-141, score-0.23]

59 Second, when trying to learn a sequence of states (st ’s) given a long trajectory of futures (ft ’s), each ft is a sample of information directly from the distribution we’re trying to model. [sent-143, score-0.25]

60 Given a parameter estimate, an initial state s0 , and a sequence of observations, the sequence of st ’s is completely determined. [sent-144, score-0.397]

61 Although the sequence of state vectors st are the parameters defining the distributions p(F n |ht ), they are not the model parameters – that is, we cannot freely select them. [sent-146, score-0.522]

62 Instead, the model parameters are the parameters θ which govern the extension function. [sent-147, score-0.223]

63 This is a significant difference from standard maximum entropy models, and stems from the fact that our overall problem is that of modeling a dynamical system, rather than just density estimation. [sent-148, score-0.236]

64 We will find it more conveT nient to measure the likelihood of the corresponding ft ’s: p(o1 , o2 . [sent-153, score-0.221]

65 oT ) ≈ n t=1 p(ft |ht ) (the likelihoods are not the same because the likelihood of the ft ’s counts a single observation n times; the approximate equality is because the first n and last n are counted fewer than n times). [sent-156, score-0.306]

66 The expected log-likelihood of the training ft ’s under the model defined in Eq. [sent-157, score-0.203]

67 Computing this is a standard inference problem in exponential family models, as discussed in Section 5. [sent-164, score-0.254]

68 This gradient tells us that we wish to adjust each state to make the expected features of the next n observations closer to the observed features however, we cannot adjust st directly; instead, we must adjust it implicitly by adjusting the transition parameters A and B. [sent-165, score-0.712]

69 We now compute the gradients of the state with respect to each parameter: ∂ ∂st−1 ∂st = G(ot+1 ) (Ast−1 + B) = G(ot+1 ) A + s⊤ ⊗ I . [sent-166, score-0.194]

70 The gradients of the state with respect to B are given by ∂st ∂ ∂st−1 = G(ot+1 ) (Ast−1 + B) = G(ot+1 ) A +I ∂B ∂B ∂B These gradients are temporally recursive – they implicitly depend on gradients from all previous timesteps. [sent-168, score-0.358]

71 Likelihoods for naive predictions are shown as a dotted line near the bottom; likelihoods for optimal predictions are shown as a dash-dot line near the top. [sent-180, score-0.123]

72 5 Inference In order to compute the gradients needed for model learning, the expected sufficient statistics E[φ(F n |ht )] at each timestep must be computed (see Eq. [sent-202, score-0.175]

73 For example, each possible set of canonical parameters s induces one set of mean parameters; assuming that the features are linearly independent, each set of valid mean parameters is uniquely determined by one set of canonical parameters [9]. [sent-205, score-0.245]

74 In terms of inference, our model inherits all of the properties of graphical models, for better and for worse. [sent-208, score-0.118]

75 The first set tested whether the model could capture exact state, given the correct features and exact inference. [sent-212, score-0.227]

76 We evaluated the learned model using exact inference to compute the exact likelihood of the data, and compared to the true likelihood. [sent-213, score-0.279]

77 The second set tested larger models, for which exact inference is not possible. [sent-214, score-0.119]

78 One objective gauge is control performance: if the model has a reward signal, reinforcement learning can be used to determine an optimal policy. [sent-217, score-0.168]

79 Evaluating the reward achieved becomes an objective measure of model quality, even though approximate likelihood is the learning signal. [sent-218, score-0.166]

80 For each problem, training and test sets were generated (using a uniformly random policy for controlled systems). [sent-229, score-0.128]

81 We then learned the best model possible, and compared the final log-likelihood under the learned and true models. [sent-235, score-0.139]

82 The third panel shows results for a very noisy POMDP, in which the naive and true LLs are very close; this indicates that prediction is difficult, even with a perfect model. [sent-241, score-0.147]

83 The table conveys similar information to the graphs: naive and true log-likelihoods, as well as the loglikelihood of the learned models (on both training and test sets). [sent-243, score-0.206]

84 To help interpret the results, we also report a percentage (highlighted in bold), which indicates the amount of the likelihood gap (between the naive and true models) that was captured by the learned model. [sent-244, score-0.229]

85 We experimented with loopy belief propagation (LBP) [12], naive mean field (or variational mean field, VMF), and log-determinant relaxations (LDR) [10]. [sent-249, score-0.156]

86 The NAC algorithm requires two things: a stochastic, parameterized policy which operates as a function of state, and the gradients of the log probability of that policy. [sent-253, score-0.154]

87 We used a softmax function of a linear projection of the state: the probability of taking action ai from state st |A| given the policy parameters θ is: p(ai ; st , θ) = exp s⊤ θi / j=1 exp s⊤ θj . [sent-254, score-0.852]

88 For comparison, we also ran the NAC planner with the POMDP belief state: we used the same stochastic policy and the same gradients, but we used the belief state of the true POMDP in place of the EFPSR’s state (st ). [sent-256, score-0.497]

89 The left panel shows the best control performance obtained (average reward per timestep) as a function of steps of optimization. [sent-260, score-0.144]

90 The “POMDP” line shows the best reward obtained using the true belief state as computed under the true model, the “Random” line shows the reward obtained with a random policy, and the “Reactive” line shows the best reward obtained by using the observation as input to the NAC algorithm. [sent-261, score-0.528]

91 html 7 The EFPSR models all start out with performance equivalent to the random policy (average reward of 0. [sent-267, score-0.206]

92 This is close to the average reward of using the true POMDP state at 0. [sent-270, score-0.247]

93 The EFPSR policy closes about 94% of the gap between a random policy and the policy obtained with the true model. [sent-272, score-0.38]

94 Surprisingly, only a few iterations of optimization were necessary to generate a usable state representation. [sent-273, score-0.133]

95 8% of the gap between a random policy and the optimal policy. [sent-275, score-0.12]

96 We conclude that the EFPSR has learned a model which successfully incorporates information from history into the state representation, and that it is this information which the NAC algorithm uses to obtain better-than-reactive performance. [sent-276, score-0.25]

97 This implies that the model and learning algorithm are useful even with approximate inference methods, and even in cases where we cannot compare to the exact likelihood. [sent-277, score-0.124]

98 7 Conclusions We have presented the Exponential Family PSR, a new model of controlled, stochastic dynamical systems which provably unifies other models with predictively defined state. [sent-278, score-0.208]

99 We were able to learn almost perfect models of several small POMDP systems, both from a likelihood perspective and from a control perspective. [sent-280, score-0.111]

100 While slow, the learning algorithm generates models which can be accurate in terms of likelihood and useful in terms of control performance. [sent-283, score-0.111]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ht', 0.556), ('efpsr', 0.359), ('ot', 0.323), ('st', 0.264), ('ft', 0.17), ('psr', 0.148), ('pomdp', 0.135), ('state', 0.133), ('ll', 0.132), ('dynamical', 0.12), ('nac', 0.11), ('family', 0.103), ('exponential', 0.095), ('policy', 0.093), ('naive', 0.082), ('reward', 0.082), ('ast', 0.074), ('conditioning', 0.072), ('extension', 0.064), ('ldr', 0.063), ('plg', 0.063), ('psrs', 0.063), ('reactive', 0.063), ('vmf', 0.063), ('gradients', 0.061), ('observable', 0.06), ('entropy', 0.059), ('inference', 0.056), ('futures', 0.055), ('maze', 0.055), ('graphical', 0.054), ('features', 0.053), ('pomdps', 0.052), ('likelihood', 0.051), ('graph', 0.051), ('history', 0.047), ('pietra', 0.047), ('lbp', 0.047), ('parameters', 0.046), ('observation', 0.044), ('observations', 0.044), ('capture', 0.043), ('cheesemaze', 0.042), ('closes', 0.042), ('fti', 0.042), ('paint', 0.042), ('rudary', 0.042), ('saying', 0.042), ('wingate', 0.042), ('michigan', 0.042), ('temporally', 0.042), ('carefully', 0.042), ('likelihoods', 0.041), ('belief', 0.041), ('predictive', 0.041), ('dynamics', 0.039), ('rk', 0.039), ('learned', 0.037), ('satinder', 0.037), ('exact', 0.035), ('controlled', 0.035), ('govern', 0.034), ('conjunctions', 0.034), ('model', 0.033), ('emphasize', 0.033), ('propagation', 0.033), ('panel', 0.033), ('true', 0.032), ('adjust', 0.031), ('singh', 0.031), ('inherits', 0.031), ('models', 0.031), ('representations', 0.03), ('density', 0.029), ('control', 0.029), ('tracking', 0.029), ('modeling', 0.028), ('contrastive', 0.028), ('everything', 0.028), ('est', 0.028), ('appeal', 0.028), ('timestep', 0.028), ('tested', 0.028), ('statistics', 0.027), ('canonical', 0.027), ('gap', 0.027), ('atomic', 0.027), ('selecting', 0.026), ('must', 0.026), ('named', 0.026), ('subscripts', 0.026), ('exp', 0.026), ('feature', 0.025), ('loose', 0.025), ('distribution', 0.025), ('reinforcement', 0.024), ('stochastic', 0.024), ('loglikelihood', 0.024), ('inducing', 0.024), ('shifted', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

2 0.25257161 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1

3 0.23432246 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

4 0.17235518 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1

5 0.14591487 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

Abstract: Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time. 1

6 0.13644198 30 nips-2007-Bayes-Adaptive POMDPs

7 0.13415958 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

8 0.13077877 215 nips-2007-What makes some POMDP problems easy to approximate?

9 0.10944937 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

10 0.10689116 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

11 0.096840389 147 nips-2007-One-Pass Boosting

12 0.094833195 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

13 0.094070494 191 nips-2007-Temporal Difference Updating without a Learning Rate

14 0.093465693 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

15 0.092548504 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

16 0.082704633 162 nips-2007-Random Sampling of States in Dynamic Programming

17 0.08044973 197 nips-2007-The Infinite Markov Model

18 0.077206269 127 nips-2007-Measuring Neural Synchrony by Message Passing

19 0.071075447 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess

20 0.06591706 187 nips-2007-Structured Learning with Approximate Inference


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.224), (1, -0.258), (2, 0.047), (3, -0.064), (4, -0.049), (5, 0.003), (6, 0.002), (7, -0.031), (8, 0.095), (9, -0.103), (10, 0.068), (11, 0.101), (12, -0.117), (13, -0.0), (14, -0.035), (15, -0.035), (16, 0.092), (17, -0.008), (18, -0.113), (19, 0.172), (20, -0.129), (21, 0.065), (22, 0.076), (23, 0.094), (24, -0.066), (25, 0.237), (26, -0.048), (27, -0.01), (28, 0.099), (29, 0.032), (30, 0.143), (31, -0.003), (32, -0.002), (33, -0.051), (34, -0.022), (35, 0.034), (36, -0.042), (37, 0.062), (38, 0.009), (39, -0.026), (40, -0.133), (41, -0.055), (42, 0.013), (43, -0.051), (44, -0.057), (45, 0.137), (46, 0.084), (47, 0.001), (48, -0.024), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93189245 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

2 0.6715095 191 nips-2007-Temporal Difference Updating without a Learning Rate

Author: Marcus Hutter, Shane Legg

Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1

3 0.65074283 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

4 0.55236495 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

Author: Alan S. Willsky, Erik B. Sudderth, Martin J. Wainwright

Abstract: Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so–called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1

5 0.49573901 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

Author: Joseph K. Bradley, Robert E. Schapire

Abstract: We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve bounds equivalent to or better than previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm’s strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification. 1

6 0.49554408 127 nips-2007-Measuring Neural Synchrony by Message Passing

7 0.494856 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.4895868 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

9 0.46199471 30 nips-2007-Bayes-Adaptive POMDPs

10 0.44355249 21 nips-2007-Adaptive Online Gradient Descent

11 0.44162613 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

12 0.40949267 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess

13 0.38417891 215 nips-2007-What makes some POMDP problems easy to approximate?

14 0.35219693 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

15 0.35084054 197 nips-2007-The Infinite Markov Model

16 0.33666274 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

17 0.33427542 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

18 0.31629768 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

19 0.31107301 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

20 0.30138338 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.051), (8, 0.223), (9, 0.02), (13, 0.081), (16, 0.029), (18, 0.027), (21, 0.078), (31, 0.015), (34, 0.037), (35, 0.024), (47, 0.105), (49, 0.013), (83, 0.109), (85, 0.037), (87, 0.016), (90, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81980741 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents

Author: Yee W. Teh, Daniel J. Hsu, Hal Daume

Abstract: We introduce a new Bayesian model for hierarchical clustering based on a prior over trees called Kingman’s coalescent. We develop novel greedy and sequential Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We show experimentally the superiority of our algorithms over the state-of-the-art, and demonstrate our approach in document clustering and phylolinguistics. 1

same-paper 2 0.7860741 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

3 0.6409173 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

4 0.6379537 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

5 0.63463157 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

6 0.63423634 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

7 0.63383967 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.63324666 100 nips-2007-Hippocampal Contributions to Control: The Third Way

9 0.63156909 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

10 0.63082427 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

11 0.62938201 69 nips-2007-Discriminative Batch Mode Active Learning

12 0.62738591 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

13 0.62491852 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

14 0.62491757 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

15 0.62386501 24 nips-2007-An Analysis of Inference with the Universum

16 0.62360305 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

17 0.62316298 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

18 0.62310827 115 nips-2007-Learning the 2-D Topology of Images

19 0.62268782 158 nips-2007-Probabilistic Matrix Factorization

20 0.61936319 134 nips-2007-Multi-Task Learning via Conic Programming