nips nips2011 nips2011-37 knowledge-graph by maker-knowledge-mining

37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Source: pdf

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. [sent-8, score-0.231]

2 There have been no exact results for the error in the biologically plausible setting of inference on point process, however. [sent-10, score-0.143]

3 We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. [sent-11, score-0.53]

4 This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. [sent-12, score-0.197]

5 We ﬁnd that there is an optimal tuning width for which the error is minimized. [sent-14, score-0.294]

6 This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. [sent-15, score-0.323]

7 Here, we concentrate on the problem of Bayesian ﬁltering of stochastic processes. [sent-18, score-0.163]

8 There have been many studies on ﬁltering of stimuli by biological systems [1, 2, 3], however, there are very few analytical results regarding the error of Bayesian ﬁltering. [sent-19, score-0.271]

9 We provide exact expressions for the evolution of the Mean Squared Error (MSE) of Bayesian ﬁltering for a class of Gaussian processes. [sent-20, score-0.299]

10 Results for expected errors of Gaussian processes had been sofar obtained only for the problem of smoothing, where predictions are not online but have to be made using past and future observations [4, 5]. [sent-21, score-0.28]

11 The present work seeks to give an account of the error properties in Bayesian ﬁltering of stochastic processes. [sent-22, score-0.185]

12 We start by analysing the case of Markovian processes in section 2. [sent-23, score-0.174]

13 We ﬁnd a set of ﬁltering equations from which we can derive a differential equation for the expected mean squared error. [sent-24, score-0.405]

14 We present an implicit equation to optimize the encoding scheme in the case of Poisson spike observations. [sent-26, score-0.246]

15 We also provide a full stochastic model of the evolution of the error, which can 1 be solved analytically in a given interval. [sent-27, score-0.339]

16 In section 3 we show an application to optimal population coding in sensory neurons. [sent-29, score-0.419]

17 Our theoretical results contribute to the ongoing research on ecological theories in biological signal processing (e. [sent-32, score-0.18]

18 , [6]), which argue that performance of sensory systems can be enhanced by allowing sensors to adapt to the statistics of the environment. [sent-34, score-0.165]

19 More concretely, let X(t) be a stochastic process, and let M ‘sensory’ processes be deﬁned, each of which generates a Poisson point process with a time-dependent rate function λm (X(t), t), m = 1, 2, . [sent-41, score-0.435]

20 Such a stochastic process is often referred to as a doubly stochastic point process. [sent-45, score-0.377]

21 In a neuroscience context λm (·) represents the tuning function of the m’th sensory cell. [sent-46, score-0.342]

22 In order to maintain analytic tractability we focus in this work on a Gaussian form for λm , given by λm (X(t), t) = φ exp −(X(t) − θm )2 /2α(t)2 , where θm are the tuning function centers. [sent-47, score-0.141]

23 We will assume the tuning function centers are equally spaced with spacing ∆θ, for simplicity, although this is not essential to our arguments. [sent-48, score-0.144]

24 Though the rate of observations for the individual processes depends on the instantaneous value of the process, it can be shown that under certain assumptions the total rate of observations (the rate by which observations by all processes are generated) is independent of the process. [sent-49, score-0.771]

25 If we assume that the processes are independent and assume that the probability of the stimulus falling outside the range spanned by the tuning function centers is negligible, we obtain the total rate of observations √ (X(t) − θm )2 2πφα(t) λ(t) ≈ λm (X(t), t) = φ exp − . [sent-50, score-0.598]

26 Denoting the set of observations generated by the sensory processes by ξ = {(ti , mi , Θi )}1 , we have the probability of a given set of observations ξ given a stimulus history X[t0 ,t] P (ξ|X[t0 ,t] ) = e− m tf t0 λmi (X(ti ), ti ) = e− λm (X(t),t)dt i tf t0 λ(t)dt λmi (X(ti ), ti ). [sent-52, score-1.092]

27 The equations involved are Gaussian and evaluating them we obtain the usual Gaussian process regression equations (see [13] and [14, p. [sent-57, score-0.307]

28 17]) −1 K(t − ti )Cij Θj , µ(t, ξ) = −1 K(t − ti )Cij K(tj − t), 2 s(t, ξ) = K(0) − i,j (1) i,j where K(t − t ) is the auto-correlation function or kernel of the Gaussian process X(t). [sent-58, score-0.428]

29 1 Here the time ti denotes the time of the i-th observation, mi gives the identity of the sensor making the observation and Θi = θmi is the mean of the Gaussian rate function. [sent-60, score-0.313]

30 This is the minimal mean-squared error of the optimal Bayesian estimator ˆ X(t; ξ) = X(t) X(t)|ξ with respect to a mean-squared error loss function. [sent-62, score-0.206]

31 Note that the posterior variance is independent of the value of the observations, depending solely on the observation times. [sent-65, score-0.228]

32 If we make a Markov assumption about the structure of the kernel K(t − t ) we are able to make statements about the evolution of the posterior variance between observations. [sent-68, score-0.397]

33 This allows us to derive the differential Chapman-Kolmogorov equation [15] for the evolution of the posterior variance and then obtain the evolution of the MMSE. [sent-69, score-0.933]

34 For the Ornstein-Uhlenbeck process η2 dX(t) = −γX(t)dt + ηdW (t) we have the kernel k(τ ) = 2γ e−γ|τ | and the differential equation for the evolution of the posterior variance between observations (see [16, p. [sent-70, score-0.975]

35 (2) dt When a new observation arrives, the distribution is updated through Bayes’ rule. [sent-72, score-0.275]

36 (3) α2 (t) + s(t) α (t) + s(t) Here, as before, the posterior variance is independent of the speciﬁc observation θi , therefore we need only concentrate on the times of observations for purposes of modeling the posterior variance. [sent-74, score-0.505]

37 P (X(t)|(t, θi )) = N The evolution of the posterior variance is a Markov process which is driven by a deterministic drift, given in Eq. [sent-75, score-0.52]

38 This continuous time stochastic process is deﬁned by a transition probability which in the time limit of inﬁnitesimal time dt → 0 is given by α(t)2 s . [sent-78, score-0.462]

39 (4) α(t)2 + s In the equation above, the ﬁrst term accounts for the drift given in Eq. [sent-79, score-0.267]

40 2 and the second term accounts for the jumps given by Eq. [sent-80, score-0.169]

41 (5) 2−s ∂t ∂s α α2 − s This equation is, however, too complicated to be solved exactly in the general case. [sent-84, score-0.207]

42 We can use it to (s,t) derive the evolution of statistical averages by noting that d f (s) = dsf (s) ∂P∂t . [sent-85, score-0.176]

43 For f (s) = s dt we obtain an exact equation for the evolution of the average error. [sent-86, score-0.664]

44 (6) P (s,t) Mean ﬁeld approximation We will now derive a good closed form approximate equation for the expected posterior variance = s from (6). [sent-89, score-0.39]

45 (7) dt α(t)2 + mf 3 This approximation works remarkably well, giving an excellent account of the equilibrium regime and of the relaxation of the error as can be seen in Fig. [sent-92, score-0.689]

46 The maximal observation rate φ is quite trivial, as an increase in φ increases the effect of observations linearly. [sent-95, score-0.18]

47 This is interesting as it accounts for sharpening of the gaussian rates when the error is small and broadening when the error is large. [sent-98, score-0.305]

48 2 Exact results for the stationary distribution We will now assume that both λ and α are time independent so that the stochastic process converges (s,t) to a stationary state described by ∂P∂t = 0. [sent-100, score-0.385]

49 Hence the differential Chapman-Kolmogorov equation (specialised to the stationary state) is simply − d [γz(2 − z)P (z)] + λP (z − δ) − λP (z) = 0 dz (9) Viewing z as a temporal variable, we can treat Eq. [sent-105, score-0.462]

50 9 as a delay differential equation which depends on p at previous values of z. [sent-106, score-0.35]

51 9 would however become a simple ordinary linear differential equation with a known inhomogeneity P (z − δ) in the interval z0 ≤ z ≤ z0 + δ which could be solved explicitly by numerical quadrature. [sent-108, score-0.448]

52 Since jumps can only increase z and since also z(t) > 0 for z < 2, we ﬁnd that ˙ in the stationary state, the interval 0 ≤ z < 2 will become depopulated. [sent-111, score-0.206]

53 Hence, for 2 ≤ z ≤ 2 + δ we have d − [γz(2 − z)P (z)] = λP (z) dz which is solved by P (z) ∝ z −2 (1 − 2/z)−1+λ/2γ Transforming back to the original error variable s yields λ Peq (s) ∝ (η 2 − 2γs) 2γ −1 . [sent-112, score-0.191]

54 This is a very interesting result, as it shows a diverging behaviour +η in the equilibrium for values of λ < 2γ. [sent-114, score-0.24]

55 When the average time between observations τobs = 1/λ is smaller than the relaxation time of the process’ variance τvar = 1/2γ, the most probable value for the error will be the equilibrium variance of the observed process η 2 /2γ. [sent-117, score-0.665]

56 This is expected to hold for small jumps δ (when the system is trivially almost deterministic) and/or for large jump rates λ, when the density of jumps is so large that relative ﬂuctuations are small. [sent-126, score-0.249]

57 Using again a simple mean ﬁeld argument as before shows that in such 4 Figure 1: Comparison of the different regimes for the equilibrium distribution. [sent-127, score-0.202]

58 situations we ﬁnd that in equilibrium z should be close to z ∗ = 1 + 1 + λδ/γ. [sent-135, score-0.202]

59 Linearising also the drift γz(2 − z) around z ∗ yields a Fokker-Planck equation which is equivalent to a simple diffusion process (of the Ornstein-Uhlenbeck type) which is solved by the Gaussian density P (z) = N 1+ 1 + λδ/γ, (4γ √λδ 2 1+λδ/γ) . [sent-137, score-0.385]

60 1 we present the different approximations compared to the simulated histograms of the posterior variance. [sent-139, score-0.158]

61 3 Optimal Population Coding As an application we look into the problem of neural population coding of dynamic stimuli (see [13]). [sent-143, score-0.354]

62 We model the spiking of neurons as doubly stochastic Poisson processes driven by the stimulus X(t), that is the probability of a given neuron ﬁring a spike in a given interval [t, t + dt] is given by √ (X(t)−θm )2 2πφα(t)dt − 2α(t)2 dt, and Pt (spike|X(t)) ≈ Pt (spikem |X(t)) = φe = λ(t)dt. [sent-144, score-0.693]

63 ∆θ Under these assumptions, the inference from a spike train is equivalent to that on observations of data, and the MMSE follows the differential Eq. [sent-145, score-0.406]

64 Again, the fact that the posterior variance depends solely on the spike times allows us to substitute the spiking processes for each neuron with one spiking process for the whole population, simplifying greatly our calculations. [sent-147, score-0.732]

65 We compare the framework derived with the dynamic population coding presented in [13] in Fig. [sent-148, score-0.278]

66 The mean-ﬁeld approximation works remarkably well, yielding a relative error smaller than 2% throughout the range of parameters. [sent-155, score-0.17]

67 5 Figure 2: Neural coding of an second-order Markov process as described in the text. [sent-159, score-0.209]

68 Top ﬁgure shows the process overlayed with posterior mean and conﬁdence intervals. [sent-160, score-0.234]

69 The bottom plot shows the posterior variance of one sample run in black, the average over a thousand runs in blue and the mean-ﬁeld dynamics in red. [sent-161, score-0.261]

70 The mean-ﬁeld leads to a very good approximation, and the optimal α for the approximation is a good estimator for the optimal α in the simulation. [sent-165, score-0.141]

71 6 4 Filtering Smoother Processes To study the ﬁltering of smoother processes we will look at higher-order Markov processes. [sent-166, score-0.236]

72 We do so by considering a multidimensional stochastic process which is Markovian if we consider all of the components, but restrict ourselves to one component, which will then exhibit a non-Markovian structure. [sent-167, score-0.261]

73 This is done by an extension to the Ornstein-Uhlenbeck process frequently used in Gaussian process literature, whose correlation structure is given by the Matern kernel (see below). [sent-168, score-0.276]

74 We have to work with the covariance matrix of the system, since its elements’ dynamics are coupled. [sent-169, score-0.143]

75 We consider a p-th order stochastic process such as ap+1 X (p) (t) + ap X (p−1) (t) + · · · + a1 X(t) = ηZ(t), where Z(t) is white Gaussian noise with covariance δ(t − t ) and X (n) (t) denotes the n-th derivative of X(t). [sent-172, score-0.35]

76 Writing the proper Ito stochastic differential equations we obtain a set of p − 1 ﬁrst order differential equations and a single ﬁrst order stochastic differential equation, p ˙ ˙ ˙ X1 = X2 , X2 = X3 , . [sent-173, score-0.966]

77 , Xp−1 = Xp , ap+1 dXp = − ai Xi dt + ηdWt , i=1 p where Wt is the Wiener process. [sent-176, score-0.236]

78 We can control the smoothness of the process X1 (t) with the parameter ν, increasing it yields successively smoother processes (see supplementary information). [sent-179, score-0.396]

79 We can express this as a multidimensional stochastic process by choosing Γi,j = −δi,j−1 + δi,p aj 1/2 and Σi,j = δi,p δj,p η, where δi,j is the Kronecker delta. [sent-180, score-0.261]

80 We then have the Ito stochastic differential equation dX(t) = −ΓX(t)dt + Σ1/2 dW (11) for X(t)T = (X1 (t), X2 (t), . [sent-181, score-0.454]

81 (12) dt This can be solved using the solution of the homogeneous equation Σ(t) = exp[−tΓ] exp[−tΓt ] and the solution to the inhomogeous equation given by the equilibrium solution. [sent-187, score-0.793]

82 (13) σi,j = σi,j − 2 α + σ1,1 Putting equations 12 and 13 together we obtain the differential Chapman-Kolmogorov equation for the evolution of the probability of the covariance matrix. [sent-192, score-0.694]

83 With this we obtain the differential equation for the average posterior covariance matrix d σi,j = σi+1,j + σi,j+1 − dt σ1,i σ1,j +η 2 δi,n δj,n , α(t)2 + σ1,1 l (14) where we abuse the notation by using that σi,j = 0,if i > p or j > p. [sent-193, score-0.811]

84 These can be solved in the mean-ﬁeld approximation to obtain an approximation for the covariance matrix. [sent-194, score-0.278]

85 With these expressions we can then use the equilibrium conditions for dt (δi,p al σi,l + δj,p al σj,l )−λ(t) 7 Figure 4: MMSE for a second-order stochastic process. [sent-196, score-0.603]

86 On the left is the color map of the ﬁrst diagonal element of the covariance matrix for the ν = 3/2 case, corresponding to the variance of the observed stimulus variable and on the right, the same element as a function of α for a few values of φ. [sent-197, score-0.245]

87 In red we show the MMSE for the simulated equilibrium variance for comparison. [sent-200, score-0.325]

88 The dependence of the error on the parameters resembles strongly that of the Ornstein-Uhlenbeck process, showing a ﬁnite optimal value of α which minimizes the error given φ. [sent-207, score-0.245]

89 Note that for the second-order process the MMSE relative to the variance of the observed process (MMSE/K(0)) drops to lower values than in the Ornstein-Uhlenbeck process, leading to a better state estimation. [sent-209, score-0.358]

90 5 Discussion We have shown that the dynamics of Bayesian state estimation error for Markovian processes can be modelled by a simple dynamic system. [sent-211, score-0.432]

91 This provides insight into generalization properties of Gaussian process inference in an online, causal setting, where previous generalization error calculations [4, 5] for Gaussian processes do not apply. [sent-212, score-0.414]

92 In the context of ﬁltering the usual generalization error calculations do not apply. [sent-213, score-0.151]

93 Furthermore, we have demonstrated that a simple mean-ﬁeld approximation succesfully captures the dynamics of the average error of the described inference framework. [sent-214, score-0.206]

94 One key feature we were able to verify is the existence of an optimal tuning width for Gaussian-tuned Poisson processes which minimizes the MMSE, as has been veriﬁed elsewhere for static stimuli ([17, 12, 18]). [sent-216, score-0.463]

95 Future research could concentrate in generalizing the presented framework towards more realistic spike generation models, such as integrate-and-ﬁre neurons. [sent-218, score-0.157]

96 These results provide a promising ﬁrst step towards a mathematical theory of ecologically grounded sensory processing. [sent-220, score-0.165]

97 A neural network implementing optimal state estimation based on dynamic spike train decoding. [sent-244, score-0.247]

98 Learning curves for gaussian process regression: Approximations and bounds. [sent-252, score-0.202]

99 Could information theory provide an ecological theory of sensory processing? [sent-257, score-0.238]

100 Error-based analysis of optimal tuning functions explains phenomena observed in sensory neurons. [sent-295, score-0.317]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmse', 0.403), ('dt', 0.236), ('differential', 0.202), ('equilibrium', 0.202), ('evolution', 0.176), ('processes', 0.174), ('sensory', 0.165), ('equation', 0.148), ('ti', 0.137), ('eld', 0.129), ('population', 0.123), ('process', 0.122), ('ron', 0.115), ('posterior', 0.112), ('tuning', 0.108), ('jumps', 0.106), ('observations', 0.106), ('stochastic', 0.104), ('spike', 0.098), ('stimulus', 0.097), ('meir', 0.094), ('obs', 0.094), ('ltering', 0.092), ('coding', 0.087), ('opt', 0.085), ('berlin', 0.085), ('error', 0.081), ('mf', 0.081), ('gaussian', 0.08), ('variance', 0.077), ('stimuli', 0.076), ('markovian', 0.075), ('ecological', 0.073), ('dynamics', 0.072), ('matern', 0.071), ('susemihl', 0.071), ('covariance', 0.071), ('neuroscience', 0.069), ('biological', 0.069), ('bayesian', 0.069), ('mi', 0.068), ('dynamic', 0.068), ('cij', 0.068), ('accounts', 0.063), ('smoother', 0.062), ('exact', 0.062), ('width', 0.061), ('expressions', 0.061), ('stationary', 0.061), ('poisson', 0.06), ('solved', 0.059), ('concentrate', 0.059), ('filtering', 0.058), ('ito', 0.058), ('drift', 0.056), ('equations', 0.055), ('ou', 0.054), ('approximation', 0.053), ('ap', 0.053), ('neuron', 0.053), ('xp', 0.052), ('great', 0.051), ('dz', 0.051), ('tf', 0.051), ('spiking', 0.048), ('doubly', 0.047), ('dw', 0.047), ('manfred', 0.047), ('simulated', 0.046), ('dynamically', 0.045), ('analytical', 0.045), ('optimal', 0.044), ('tj', 0.043), ('bernstein', 0.043), ('obtain', 0.042), ('sound', 0.041), ('alex', 0.041), ('heidelberg', 0.04), ('uctuations', 0.039), ('universit', 0.039), ('interval', 0.039), ('dependence', 0.039), ('observation', 0.039), ('behaviour', 0.038), ('theories', 0.038), ('rescaling', 0.038), ('smoothness', 0.038), ('state', 0.037), ('calculations', 0.037), ('jump', 0.037), ('remarkably', 0.036), ('centers', 0.036), ('rate', 0.035), ('multidimensional', 0.035), ('sensor', 0.034), ('analytic', 0.033), ('driven', 0.033), ('writing', 0.033), ('usual', 0.033), ('kernel', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

2 0.22383904 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We ﬁnd that for large neural populations carrying a ﬁnite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we ﬁnd that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched ﬁring rates. We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.

3 0.18692015 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

4 0.18590294 302 nips-2011-Variational Learning for Recurrent Spiking Networks

Author: Danilo J. Rezende, Daan Wierstra, Wulfram Gerstner

Abstract: We derive a plausible learning rule for feedforward, feedback and lateral connections in a recurrent network of spiking neurons. Operating in the context of a generative model for distributions of spike sequences, the learning mechanism is derived from variational inference principles. The synaptic plasticity rules found are interesting in that they are strongly reminiscent of experimental Spike Time Dependent Plasticity, and in that they differ for excitatory and inhibitory neurons. A simulation conﬁrms the method’s applicability to learning both stationary and temporal spike patterns. 1

5 0.17734407 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

Abstract: We consider the problem of Bayesian inference for continuous-time multi-stable stochastic systems which can change both their diffusion and drift parameters at discrete times. We propose exact inference and sampling methodologies for two speciﬁc cases where the discontinuous dynamics is given by a Poisson process and a two-state Markovian switch. We test the methodology on simulated data, and apply it to two real data sets in ﬁnance and systems biology. Our experimental results show that the approach leads to valid inferences and non-trivial insights. 1

6 0.14254282 24 nips-2011-Active learning of neural response functions with Gaussian processes

7 0.1364769 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

8 0.13297445 219 nips-2011-Predicting response time and error rates in visual search

9 0.11240868 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

10 0.10723209 86 nips-2011-Empirical models of spiking in neural populations

11 0.10016808 261 nips-2011-Sparse Filtering

12 0.099682271 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

13 0.098966278 301 nips-2011-Variational Gaussian Process Dynamical Systems

14 0.09784735 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data

15 0.09643431 101 nips-2011-Gaussian process modulated renewal processes

16 0.090415895 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

17 0.089961469 298 nips-2011-Unsupervised learning models of primary cortical receptive fields and receptive field plasticity

18 0.08979588 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

19 0.088183269 249 nips-2011-Sequence learning with hidden units in spiking neural networks

20 0.087640889 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.255), (1, 0.056), (2, 0.275), (3, -0.042), (4, 0.004), (5, -0.029), (6, -0.028), (7, -0.059), (8, 0.01), (9, 0.085), (10, -0.052), (11, -0.062), (12, 0.045), (13, -0.013), (14, 0.014), (15, 0.062), (16, 0.031), (17, 0.035), (18, -0.02), (19, -0.014), (20, -0.027), (21, -0.036), (22, 0.049), (23, -0.054), (24, -0.073), (25, 0.027), (26, -0.165), (27, 0.035), (28, -0.0), (29, -0.027), (30, -0.058), (31, 0.146), (32, -0.072), (33, -0.05), (34, -0.017), (35, -0.019), (36, 0.036), (37, 0.079), (38, -0.025), (39, 0.077), (40, 0.004), (41, 0.06), (42, 0.035), (43, 0.016), (44, 0.103), (45, 0.066), (46, 0.012), (47, -0.021), (48, 0.009), (49, -0.148)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96554649 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

2 0.78874087 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

3 0.76557767 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

4 0.74229634 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

5 0.72473294 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efﬁcient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

6 0.68877226 24 nips-2011-Active learning of neural response functions with Gaussian processes

7 0.64424622 8 nips-2011-A Model for Temporal Dependencies in Event Streams

8 0.63777477 86 nips-2011-Empirical models of spiking in neural populations

9 0.6228348 2 nips-2011-A Brain-Machine Interface Operating with a Real-Time Spiking Neural Network Control Algorithm

10 0.60317397 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

11 0.58716357 219 nips-2011-Predicting response time and error rates in visual search

12 0.57799178 221 nips-2011-Priors over Recurrent Continuous Time Processes

13 0.57659125 75 nips-2011-Dynamical segmentation of single trials from population neural data

14 0.56890798 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

15 0.55862772 302 nips-2011-Variational Learning for Recurrent Spiking Networks

16 0.54465801 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

17 0.53585202 44 nips-2011-Bayesian Spike-Triggered Covariance Analysis

18 0.53025955 85 nips-2011-Emergence of Multiplication in a Biophysical Model of a Wide-Field Visual Neuron for Computing Object Approaches: Dynamics, Peaks, & Fits

19 0.50168043 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

20 0.49755186 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.044), (20, 0.029), (26, 0.02), (31, 0.13), (33, 0.019), (43, 0.081), (45, 0.093), (50, 0.138), (57, 0.064), (65, 0.028), (74, 0.058), (83, 0.084), (84, 0.038), (99, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87473696 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

2 0.82881415 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

3 0.82016063 102 nips-2011-Generalised Coupled Tensor Factorisation

Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli

Abstract: We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie’s distributions corresponding to βdivergences. By bounding the step size of the Fisher Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. 1

4 0.82010597 75 nips-2011-Dynamical segmentation of single trials from population neural data

Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identiﬁed by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identiﬁed without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the ﬁring rate of an isolated neuron based on the ﬁring rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reﬂect its computational role. 1

5 0.81807554 219 nips-2011-Predicting response time and error rates in visual search

Author: Bo Chen, Vidhya Navalpakkam, Pietro Perona

Abstract: A model of human visual search is proposed. It predicts both response time (RT) and error rates (RT) as a function of image parameters such as target contrast and clutter. The model is an ideal observer, in that it optimizes the Bayes ratio of target present vs target absent. The ratio is computed on the ﬁring pattern of V1/V2 neurons, modeled by Poisson distributions. The optimal mechanism for integrating information over time is shown to be a ‘soft max’ of diffusions, computed over the visual ﬁeld by ‘hypercolumns’ of neurons that share the same receptive ﬁeld and have different response properties to image features. An approximation of the optimal Bayesian observer, based on integrating local decisions, rather than diffusions, is also derived; it is shown experimentally to produce very similar predictions to the optimal observer in common psychophysics conditions. A psychophyisics experiment is proposed that may discriminate between which mechanism is used in the human brain. A B C Figure 1: Visual search. (A) Clutter and camouﬂage make visual search difﬁcult. (B,C) Psychologists and neuroscientists build synthetic displays to study visual search. In (B) the target ‘pops out’ (∆θ = 450 ), while in (C) the target requires more time to be detected (∆θ = 100 ) [1]. 1

6 0.81404251 249 nips-2011-Sequence learning with hidden units in spiking neural networks

7 0.81008565 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

8 0.80853349 133 nips-2011-Inferring spike-timing-dependent plasticity from spike train data

9 0.80801421 86 nips-2011-Empirical models of spiking in neural populations

10 0.80360216 301 nips-2011-Variational Gaussian Process Dynamical Systems

11 0.80019969 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

12 0.79883456 258 nips-2011-Sparse Bayesian Multi-Task Learning

13 0.7963267 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

14 0.79375792 66 nips-2011-Crowdclustering

15 0.79344535 180 nips-2011-Multiple Instance Filtering

16 0.79306877 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

17 0.79248482 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

18 0.79197526 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

19 0.79098976 24 nips-2011-Active learning of neural response functions with Gaussian processes

20 0.79016173 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data