nips nips2011 nips2011-131 knowledge-graph by maker-knowledge-mining

131 nips-2011-Inference in continuous-time change-point models

Source: pdf

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

Abstract: We consider the problem of Bayesian inference for continuous-time multi-stable stochastic systems which can change both their diffusion and drift parameters at discrete times. We propose exact inference and sampling methodologies for two speciﬁc cases where the discontinuous dynamics is given by a Poisson process and a two-state Markovian switch. We test the methodology on simulated data, and apply it to two real data sets in ﬁnance and systems biology. Our experimental results show that the approach leads to valid inferences and non-trivial insights. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We consider the problem of Bayesian inference for continuous-time multi-stable stochastic systems which can change both their diffusion and drift parameters at discrete times. [sent-10, score-0.696]

2 We propose exact inference and sampling methodologies for two speciﬁc cases where the discontinuous dynamics is given by a Poisson process and a two-state Markovian switch. [sent-11, score-0.33]

3 1 Introduction Continuous-time stochastic models play a prominent role in many scientiﬁc ﬁelds, from biology to physics to economics. [sent-14, score-0.135]

4 While it is often possible to easily simulate from a stochastic model, it is often hard to solve inference or parameter estimation problems, or to assess quantitatively the ﬁt of a model to observations. [sent-15, score-0.195]

5 In recent years this has motivated an increasing interest in the machine learning and statistics community in Bayesian inference approaches for stochastic dynamical systems, with applications ranging from biology [1–3] to genetics [4] to spatio-temporal systems [5]. [sent-16, score-0.287]

6 In this paper, we are interested in modelling and inference for systems exhibiting multi-stable behavior. [sent-17, score-0.184]

7 Very common in physical and biological sciences, they are also highly relevant in economics and ﬁnance, where unexpected events can trigger sudden changes in trading behavior [6]. [sent-19, score-0.238]

8 While there have been a number of approaches to Bayesian change-point inference [7–9] most of them expect the observations to be independent and coming directly from the change-point process. [sent-20, score-0.114]

9 We present both an exact and an MCMC-based approach for Bayesian inference in multi-stable stochastic systems. [sent-23, score-0.246]

10 We describe in detail two speciﬁc scenarios: the classic change-point process scenario whereby the latent process has a new value at each jump and a bistable scenario where the latent process is a stochastic telegraph process. [sent-24, score-0.658]

11 1 2 The generative model We consider a system of N stochastic differential equations (SDE) dxi = (Ai (t) − λi xi )dt + σi (t)dWi , (1) of the Ornstein-Uhlenbeck type for i = 1, . [sent-27, score-0.125]

12 The time dependencies in the drift Ai (t) and in the diffusion terms σi (t) will account for sudden changes in the system and will be further modelled by stochastic Markov jump processes. [sent-31, score-0.748]

13 Our prior assumption is that change points, where Ai and σi change their values, constitute Poisson events. [sent-32, score-0.459]

14 This means that the times ∆t between consecutive change points are independent exponentially distributed random variables with density p(∆t) = f exp(−f ∆t), where f denotes their expected number per time unit. [sent-33, score-0.291]

15 We will consider two different models for the values of Ai and σi in this paper: • Model 1 assumes that at each of the change points Ai and σi are drawn independently from ﬁxed prior densities pA (·) and pσ (·). [sent-34, score-0.328]

16 The number of change points up to time t is counted µ(t) µ(t) by the Poisson process µ(t), so that Ai (t) = Ai and σi (t) = σi are piecewise constant functions of time. [sent-35, score-0.368]

17 We select the parameters according to the telegraph process µ(t), which switches between µ = 0 and µ = 1 at each change point. [sent-37, score-0.436]

18 we assume that yj = x(tj )+ξ j with independent Gaussian noise ξ j ∼ N (0, σo ). [sent-50, score-0.126]

19 3 Bayesian Inference Given data Y we are interested in the posterior distribution of all unobserved quantities, which are the paths of the stochastic processes X ≡ x[0:T ] , Z ≡ (A[0:T ] , σ [0:T ] ) in a time interval [0 : T ] and the model parameters Λ = ({λi }). [sent-51, score-0.425]

20 Since also the data likelihood p(Y |X) is Gaussian, it is possible to integrate out the process X analytically leading to a marginal posterior p(Z|Y, Λ) ∝ p(Y |Z, Λ)p(Z) (3) over the simpler piecewise constant sample paths of the jump processes. [sent-57, score-0.578]

21 When inference on posterior values X is required, we can use the fact that X|Y, Z, Λ is an inhomogeneous Ornstein-Uhlenbeck process, which allows for an explicit analytical computation of marginal means and variances at each time. [sent-59, score-0.316]

22 τ The jump processes Z = {τ , Θ} are completely determined by the set of change points τ ≡ {τj } and the actual values of Θ ≡ {Aj , σ j } to which the system jumps at the change points. [sent-60, score-0.859]

23 Since τ τ τ p(Z) = p(Θ|τ )p(τ ) and p(Θ|τ, Y, Λ) ∝ p(Y |Z, Λ)p(Θ|τ ), we can see that conditioned on a set of, say m change points, the distribution of Θ is a ﬁnite (and usually relatively low) dimensional integral from which one can draw samples using standard methods. [sent-61, score-0.211]

24 In fact, if the prior density of the drift values pA is a Gaussian, then it is easy to see that also the posterior is Gaussian. [sent-62, score-0.27]

25 2 4 MCMC sampler architecture We use a Metropolis-within-Gibbs sampler, which alternates between sampling the parameters Λ, τ Θ from p(Λ|Y, τ , Θ), p(Θ|Y, τ , Λ) and the positions τ of change points from p(τ |Y, Θ, Λ). [sent-63, score-0.425]

26 Sampling from p(Λ|Y, τ , Θ) as well as sampling the σi s from p(Θ|Y, τ , Λ) is done by a Gaussian random walk Metropolis-Hastings sampler on the logarithm of the parameters, to ensure positivity. [sent-64, score-0.134]

27 τ τ τ Finally, we need to draw change points from their density p(τ |Y, Θ, Λ) ∝ p(Y |Z, Λ)p(Θ|τ )p(τ ). [sent-66, score-0.291]

28 The normal distribution is truncated at the neighboring jump times to ensure that the order of jump times stays the same. [sent-78, score-0.396]

29 • Adding a change point: We use a uniform distribution over the whole time interval [0 : T ] to draw the time of the added jump. [sent-79, score-0.276]

30 For model 2 it is randomly decided if the telegraph process µ(t) is inverted before or after the new change point. [sent-81, score-0.436]

31 This is necessary to allow µ to change on both ends. [sent-82, score-0.211]

32 • Removing a change point: The change point to remove is chosen at random. [sent-83, score-0.422]

33 For model 1 the newly joined interval inherits the parameters with equal probability from the interval before or after the removed change point. [sent-84, score-0.341]

34 As for adding a change point, when using model 2 we choose to either invert µ after or before the removed jump time. [sent-85, score-0.451]

35 For model 2 we also need the option to add or remove two jumps, because adding or removing one jump will result in inverting the whole process after or before it, which leads to poor acceptance rates. [sent-86, score-0.359]

36 When adding or removing two jumps instead, µ only changes between these two jumps. [sent-87, score-0.175]

37 • Adding two change points: The ﬁrst change point is drawn as for adding a single one, the second one is drawn uniformly from the interval between the new and the next change point. [sent-88, score-0.74]

38 • Removing two change points: We choose one of the change points, except the last one, at random and delete it along with the following one. [sent-89, score-0.422]

39 While the proposal does not use any information from the data, it is very fast to compute and quickly converges to reasonable states, although we initialize the change points simply by drawing from τ p(τ ). [sent-90, score-0.345]

40 5 Exact inference In the case of small systems described by model 2 it is also feasible to calculate the marginal probability distribution q(µ, x, t) for the state variables x, µ at time t of the posterior process directly. [sent-91, score-0.441]

41 For that purpose, we use a smoothing algorithm, which is quite similar to the well-known method for state inference in hidden Markov models. [sent-92, score-0.162]

42 8359 x 200 400 t 600 800 1000 1 10 100 1000 10000 100000 samples Figure 1: Comparison of the results of the MCMC sampler and the exact inference: (top left) True path of x (black) and the noisy observations (blue crosses). [sent-99, score-0.139]

43 (bottom left) True path of µ (black) and posterior of p(µ = 1) from the exact inference (green) and the MCMC sampler (red dashed). [sent-100, score-0.408]

44 Mean difference between sampler result and exact inference of p(µ = 1) for different number of samples (red crosses) and the result of power law regression for more than 100 samples (green). [sent-102, score-0.253]

45 As our model has the Markov property, the exact marginal posterior is given by 1 p(µ, x, t)ψ(µ, x, t). [sent-104, score-0.253]

46 And the last factor ψ(µ, x, t) is the likelihood of the observations after time t under the condition that the process started with state (x, µ) at time t. [sent-107, score-0.125]

47 q(µ, x, t) = The initial condition for the forward message p(µ, x, t) is the prior over the initial state of the system. [sent-108, score-0.227]

48 The time evolution of the forward message is given by the forward Chapman-Kolmogorov equation 2 σµ ∂ 2 ∂ ∂ p(µ, x, t) = + (Aµ − λx) − ∂t ∂x 2 ∂x2 [fν→µ p(ν, x, t) − fµ→ν p(µ, x, t)] . [sent-109, score-0.21]

49 By inj tegrating equation (7) forward in time from the ﬁrst observation to the last, we obtain the exact solution to the ﬁltering problem of our model. [sent-112, score-0.119]

50 Between observations the time N evolution of the backward message is given by the backward Chapman-Kolmogorov equation 2 σµ ∂ 2 ∂ ∂ ψ(µ, x, t) = + (Aµ − λx) + ∂t ∂x 2 ∂x2 fµ→ν [ψ(µ, x, t) − ψ(ν, x, t)] . [sent-115, score-0.196]

51 (10) ν=µ And each observation is taken into account by the jump condition ψ(µ, x, t− ) = ψ(µ, x, t+ )p(yj |x(tj )). [sent-116, score-0.198]

52 Actual change points are shown as vertical dotted lines. [sent-118, score-0.291]

53 (bottom row) posterior processes for A (left) and σ 2 (right) with a one standard deviation conﬁdence interval. [sent-119, score-0.217]

54 Afterwards, Lq(µ, x, t) can be calculated by multiplying forward message p(µ, x, t) and backward message ψ(µ, x, t). [sent-121, score-0.277]

55 Normalizing that quantity according to q(µ, x, t)dx = 1 (12) µ then gives us the marginal posterior as well as the total likelihood L = p(y1 , . [sent-122, score-0.202]

56 The availability of an exact solution to the inference problem provides us with an excellent way of monitoring convergence of our sampler. [sent-135, score-0.165]

57 Figure 1 shows the results of sampling on data generated from model 2, with parameter settings such that only the diffusion constant changes, making it a fairly challenging problem. [sent-136, score-0.258]

58 Despite the rather noisy nature of the data (top left panel), the approach gives a reasonable reconstruction of the latent switching process (left panel, bottom). [sent-137, score-0.133]

59 The comparison between exact inference and MCMC is also instructive, showing that the sampled posterior does indeed converge to the true posterior after a relatively short burn in period (Figure 1 right panel). [sent-138, score-0.534]

60 To test the performance of the inference approach on model 1, we simulated data from a fourdimensional diffusion process with diagonal diffusion with change points in the drift and diffusion (at the same times). [sent-145, score-1.234]

61 The results of the sampling based inference are shown in Figure 2. [sent-146, score-0.16]

62 Once again, the results indicate that the sampled distribution was able to accurately identify the change points (top right panel) and the values of the parameters (bottom panels). [sent-147, score-0.291]

63 2 Characterization of noise in stochastic gene expression Recent developments in microscopy technology have led to the startling discovery that stochasticity plays a crucial role in biology [11]. [sent-151, score-0.428]

64 A currently open question is how to characterize mathematically the difference between intrinsic and extrinsic noise, and a widely mooted opinion is that either the amplitude or the spectral characteristics of the two types of noise should be different [13]. [sent-153, score-0.197]

65 To provide a proof-of-principle investigation into these issues, we tested our model on real stochastic gene expression data subject to extrinsic noise in Bacillus subtilis [14]. [sent-154, score-0.416]

66 Here, single-cell ﬂuorescence levels of the protein comS were assayed through time-lapse microscopy over a period of 36 hours. [sent-155, score-0.169]

67 During this period, the protein was subjected to extrinsic noise in the form of activation of the regulator comK, which controls comS expression with a switch-like behavior (Hill coefﬁcient 5). [sent-156, score-0.409]

68 In both cases we sampled 500,000 posterior samples, discarding an initial burn-in of 10,000 samples. [sent-160, score-0.155]

69 Both models predict two clear change points representing the activation and inactivation of comK at approximately 5 and 23 hrs respectively (Figure 3 right panel, showing model 2 results). [sent-161, score-0.409]

70 Naturally, model 2 predicted two different values for the diffusion constant depending on the activity state of comK (Figure 4, central panel). [sent-163, score-0.26]

71 The two posterior distributions for σ1 and σ2 appear to be well separated, lending support to the unconstrained version of model 2 being a better description 6 0. [sent-164, score-0.155]

72 We can gain some insights by considering the underlying discrete dynamics of comS protein counts, which our model approximates as a continuous variable [16]. [sent-179, score-0.159]

73 As we are dealing with bacterial cells, transcription and translation are tightly coupled, so that we can reasonably assume that protein production is given by a Poisson process. [sent-180, score-0.156]

74 At steady state in the absence of comK, the production of comS proteins will be given by a birth-death process with birth rate b and death rate λ, while in the presence of comK the birth rate would change to A + b. [sent-181, score-0.54]

75 Deﬁning ρ0 = b , λ ρ1 = A+b λ (13) this simple birth-death model implies a Poisson distribution of the steady state comS protein levels in the two comK states, with parameters ρ0 , ρ1 respectively. [sent-182, score-0.24]

76 Unfortunately, we only measure the counts of comS protein up to a proportionality constant (due to the arbitrary units of ﬂuorescence); this means that the basic property of Poisson distributions of having the same mean and variance cannot be tested easily. [sent-183, score-0.117]

77 b (14) This relationship is not enforced in our model, but, if the simple birth-death interpretation is supported by the data, it should emerge naturally in the posterior distributions. [sent-185, score-0.155]

78 To test this, we plot in Figure 4 right panel the posterior distribution of f (A, b, σ1 , σ2 ) = (A + b)/σ2 − b/σ1 A+b , b (15) the difference between the posterior estimate of the ratio of the signal to noise ratios in the two comK states and the prediction from the birth-death model. [sent-186, score-0.485]

79 The overwhelming majority of the posterior probability mass is away from zero, indicating that the data does not support the predictions of the birth-death interpretation of the steady states. [sent-187, score-0.23]

80 A possible explanation of this unexpected result is that the continuous approximation breaks down in the low abundance state (corresponding to no comK activation); the expected number of particles in the comK inactive state is given by ρ0 and has posterior mean 25. [sent-188, score-0.289]

81 The breaking down of the OU approximation for these levels of protein expression would be surprising, and would sound a call for caution when using SDEs to model single cell data as advocated in large parts of the literature [2]. [sent-190, score-0.178]

82 Our results would then predict that comK regulation at the transcriptional level alone cannot explain the data, and that comS dynamics must be regulated both transcriptionally and post-transcriptionally. [sent-193, score-0.198]

83 The data, shown in Figure 5, consists of monthly closing values; we subsampled it at quarterly values. [sent-197, score-0.138]

84 The posterior processes for A and σ are shown in the central and right panels of Figure 5 respectively. [sent-198, score-0.217]

85 An inspection of these results reveals several interesting change points which can be related to known events: for convenience, we highlight a few of them in the central panel of Figure 5. [sent-199, score-0.399]

86 Clearly evident are the changes caused by the introduction of the Neuer Markt (the German equivalent of the NASDAQ) in 1997, as well as the dot-com bubble (and subsequent recession) in the early 2000s and the global ﬁnancial crisis in 2008. [sent-200, score-0.156]

87 Interestingly, in our results the diffusion (or volatility as is more commonly termed in ﬁnancial modelling) seems not to be particularly affected by recent events (after surging for the Neuer Markt). [sent-201, score-0.315]

88 A possible explanation is the rather long time interval between data points: volatility is expected to be particularly high on the micro-time scale, or at best the daily scale. [sent-202, score-0.113]

89 7 Discussion In this paper, we proposed a Bayesian approach to inference in multi-stable system. [sent-204, score-0.114]

90 The basic model is a system of SDEs whose drift and diffusion coefﬁcients can change abruptly at random, exponential distributed times. [sent-205, score-0.545]

91 We describe the approach in two special models: a system of SDEs with coefﬁcients changing at change points from a Poisson process (model 1) and a system of SDE whose coefﬁcients can change between two sets of values according to a random telegraph process (model 2). [sent-206, score-0.892]

92 Each model is particularly suitable for speciﬁc applications: while model 1 is important in ﬁnancial modelling and industrial application, model 2 extends a number of similar models already employed in systems biology [3,15,17]. [sent-207, score-0.124]

93 A new implementation in C++ for model 2 showed over 12 times faster computational times for a data set with 10 OU processes and 2 telegraph processes. [sent-212, score-0.21]

94 While the inference scheme we propose is practical in many situations, scaling to higher dimensional problems may become computationally intensive. [sent-215, score-0.114]

95 It would therefore be interesting to investigate approximate inference solutions like the ones presented in [15]. [sent-216, score-0.114]

96 Another interesting direction would be to extend the current work to a factorial design; these can be important, particularly in biological applications where multiple factors can interact in determining gene expression [17, 18]. [sent-217, score-0.222]

97 Finally, our models are naturally non-parametric in the sense that the number of change points is not a priori determined. [sent-218, score-0.291]

98 Efﬁcient bayesian inference for multiple change-point and mixture innovation models. [sent-249, score-0.157]

99 An excitable gene u e regulatory circuit induces transient cellular differentiation. [sent-287, score-0.2]

100 Large scale learning of combinatorial transcriptional dynamics from gene expression. [sent-311, score-0.273]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comk', 0.384), ('coms', 0.266), ('diffusion', 0.212), ('change', 0.211), ('jump', 0.198), ('guido', 0.156), ('posterior', 0.155), ('ruttor', 0.148), ('telegraph', 0.148), ('dax', 0.118), ('transcriptional', 0.118), ('protein', 0.117), ('inference', 0.114), ('gene', 0.113), ('panel', 0.108), ('ai', 0.107), ('nancial', 0.106), ('uorescence', 0.104), ('poisson', 0.099), ('manfred', 0.097), ('extrinsic', 0.094), ('sanguinetti', 0.089), ('sdes', 0.089), ('tj', 0.088), ('sampler', 0.088), ('stochastic', 0.081), ('points', 0.08), ('competence', 0.078), ('drift', 0.078), ('process', 0.077), ('steady', 0.075), ('message', 0.074), ('nance', 0.072), ('opper', 0.071), ('activation', 0.07), ('modelling', 0.07), ('forward', 0.068), ('noise', 0.067), ('interval', 0.065), ('mcmc', 0.064), ('andreas', 0.062), ('processes', 0.062), ('paths', 0.062), ('expression', 0.061), ('backward', 0.061), ('yj', 0.059), ('bubble', 0.059), ('burn', 0.059), ('crisis', 0.059), ('markt', 0.059), ('nasdaq', 0.059), ('neuer', 0.059), ('sde', 0.059), ('sudden', 0.059), ('switching', 0.056), ('events', 0.055), ('tu', 0.055), ('biology', 0.054), ('proposal', 0.054), ('crosses', 0.053), ('jumps', 0.053), ('microscopy', 0.052), ('recession', 0.052), ('year', 0.051), ('exact', 0.051), ('german', 0.049), ('state', 0.048), ('biological', 0.048), ('monthly', 0.048), ('hrs', 0.048), ('cellular', 0.048), ('volatility', 0.048), ('closing', 0.048), ('marginal', 0.047), ('sampling', 0.046), ('ou', 0.045), ('birth', 0.045), ('intensity', 0.045), ('system', 0.044), ('solid', 0.043), ('bayesian', 0.043), ('dynamics', 0.042), ('edward', 0.042), ('subsampled', 0.042), ('berlin', 0.042), ('removing', 0.042), ('adding', 0.042), ('dashed', 0.039), ('regulatory', 0.039), ('production', 0.039), ('integrate', 0.039), ('changes', 0.038), ('simulated', 0.038), ('dynamical', 0.038), ('regulation', 0.038), ('unexpected', 0.038), ('modelled', 0.038), ('prior', 0.037), ('bioinformatics', 0.037), ('intrinsic', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

2 0.17734407 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

3 0.11198165 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efﬁcient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

4 0.10426886 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

5 0.099032521 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efﬁciency in different circumstances, there is little speciﬁc evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1

6 0.09727446 221 nips-2011-Priors over Recurrent Continuous Time Processes

7 0.095399335 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

8 0.091419727 100 nips-2011-Gaussian Process Training with Input Noise

9 0.089238875 219 nips-2011-Predicting response time and error rates in visual search

10 0.086189307 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

11 0.082911111 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

12 0.081664845 8 nips-2011-A Model for Temporal Dependencies in Event Streams

13 0.080938548 302 nips-2011-Variational Learning for Recurrent Spiking Networks

14 0.079683945 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

15 0.079137839 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.076554887 258 nips-2011-Sparse Bayesian Multi-Task Learning

17 0.073269166 217 nips-2011-Practical Variational Inference for Neural Networks

18 0.072228514 24 nips-2011-Active learning of neural response functions with Gaussian processes

19 0.071946055 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

20 0.071590133 75 nips-2011-Dynamical segmentation of single trials from population neural data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, 0.028), (2, 0.122), (3, -0.017), (4, -0.059), (5, -0.128), (6, 0.012), (7, -0.096), (8, 0.033), (9, 0.05), (10, -0.031), (11, -0.104), (12, 0.065), (13, -0.085), (14, 0.001), (15, 0.063), (16, 0.029), (17, 0.01), (18, -0.037), (19, -0.031), (20, 0.03), (21, 0.01), (22, -0.035), (23, -0.029), (24, -0.092), (25, 0.016), (26, -0.096), (27, 0.049), (28, 0.022), (29, -0.036), (30, -0.039), (31, 0.091), (32, -0.109), (33, -0.058), (34, -0.098), (35, -0.065), (36, 0.013), (37, 0.051), (38, -0.026), (39, 0.036), (40, 0.024), (41, -0.0), (42, -0.069), (43, 0.036), (44, 0.036), (45, 0.051), (46, -0.022), (47, 0.109), (48, 0.033), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95440614 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

2 0.81304628 221 nips-2011-Priors over Recurrent Continuous Time Processes

Author: Ardavan Saeedi, Alexandre Bouchard-côté

Abstract: We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time stochastic processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efﬁciently using Particle MCMC methods [1]. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables of the beam sampler [2]. We applied our model to the problem of estimating the disease progression in multiple sclerosis [3], and to RNA evolutionary modeling [4]. In both domains, we found that our model outperformed the standard rate matrix estimation approach. 1

3 0.80148578 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

4 0.78746927 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

5 0.74374151 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

Author: Yee W. Teh, Charles Blundell, Lloyd Elliott

Abstract: We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for ﬂexible modelling of the number of clusters, and they avoid label switching non-identiﬁability problems. We develop an efﬁcient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. 1

6 0.70859218 8 nips-2011-A Model for Temporal Dependencies in Event Streams

7 0.63939756 55 nips-2011-Collective Graphical Models

8 0.63357246 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

9 0.60646206 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

10 0.55551702 301 nips-2011-Variational Gaussian Process Dynamical Systems

11 0.55447501 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

12 0.54438055 104 nips-2011-Generalized Beta Mixtures of Gaussians

13 0.52275127 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

14 0.51457077 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

15 0.50663435 75 nips-2011-Dynamical segmentation of single trials from population neural data

16 0.49859437 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

17 0.49271202 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

18 0.49085772 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

19 0.49055794 132 nips-2011-Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

20 0.48747289 285 nips-2011-The Kernel Beta Process

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (4, 0.042), (20, 0.018), (26, 0.025), (31, 0.148), (33, 0.013), (43, 0.055), (45, 0.073), (57, 0.045), (65, 0.011), (74, 0.052), (83, 0.032), (84, 0.355), (99, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78345883 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

2 0.77965748 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

Author: Bo Chen, David E. Carlson, Lawrence Carin

Abstract: Nonparametric Bayesian methods are developed for analysis of multi-channel spike-train data, with the feature learning and spike sorting performed jointly. The feature learning and sorting are performed simultaneously across all channels. Dictionary learning is implemented via the beta-Bernoulli process, with spike sorting performed via the dynamic hierarchical Dirichlet process (dHDP), with these two models coupled. The dHDP is augmented to eliminate refractoryperiod violations, it allows the “appearance” and “disappearance” of neurons over time, and it models smooth variation in the spike statistics. 1

3 0.74332136 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

Author: Yangqing Jia, Trevor Darrell

Abstract: Many applications in computer vision measure the similarity between images or image patches based on some statistics such as oriented gradients. These are often modeled implicitly or explicitly with a Gaussian noise assumption, leading to the use of the Euclidean distance when comparing image descriptors. In this paper, we show that the statistics of gradient based image descriptors often follow a heavy-tailed distribution, which undermines any principled motivation for the use of Euclidean distances. We advocate for the use of a distance measure based on the likelihood ratio test with appropriate probabilistic models that ﬁt the empirical data distribution. We instantiate this similarity measure with the Gammacompound-Laplace distribution, and show signiﬁcant improvement over existing distance measures in the application of SIFT feature matching, at relatively low computational cost. 1

4 0.73263061 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

Author: Carsten Rother, Martin Kiefel, Lumin Zhang, Bernhard Schölkopf, Peter V. Gehler

Abstract: We address the challenging task of decoupling material properties from lighting properties given a single image. In the last two decades virtually all works have concentrated on exploiting edge information to address this problem. We take a different route by introducing a new prior on reﬂectance, that models reﬂectance values as being drawn from a sparse set of basis colors. This results in a Random Field model with global, latent variables (basis colors) and pixel-accurate output reﬂectance values. We show that without edge information high-quality results can be achieved, that are on par with methods exploiting this source of information. Finally, we are able to improve on state-of-the-art results by integrating edge information into our model. We believe that our new approach is an excellent starting point for future developments in this ﬁeld. 1

5 0.51918584 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

Author: Dae I. Kim, Erik B. Sudderth

Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual speciﬁcation of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the ﬁrst model to simultaneously capture all three of these properties. The DCNT models metadata via a ﬂexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1

6 0.51751649 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

7 0.51682925 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

8 0.51608032 75 nips-2011-Dynamical segmentation of single trials from population neural data

9 0.51497751 285 nips-2011-The Kernel Beta Process

10 0.51223665 221 nips-2011-Priors over Recurrent Continuous Time Processes

11 0.50748682 219 nips-2011-Predicting response time and error rates in visual search

12 0.50632125 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

13 0.5028609 301 nips-2011-Variational Gaussian Process Dynamical Systems

14 0.50261265 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

15 0.50103974 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

16 0.50081664 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

17 0.49954018 101 nips-2011-Gaussian process modulated renewal processes

18 0.49939781 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

19 0.49621028 86 nips-2011-Empirical models of spiking in neural populations

20 0.49029344 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability