nips nips2011 nips2011-8 knowledge-graph by maker-knowledge-mining

8 nips-2011-A Model for Temporal Dependencies in Event Streams

Source: pdf

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. [sent-5, score-0.422]

2 We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. [sent-6, score-0.547]

3 We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. [sent-7, score-0.872]

4 1 Introduction The problem of modeling temporal dependencies in temporal streams of discrete events arises in a wide variety of applications. [sent-8, score-0.504]

5 For example, system error logs [14], web search query logs, the ﬁring patterns of neurons [18] and gene expression data [8], can all be viewed as streams of events over time. [sent-9, score-0.441]

6 , the web query issued or the type of error logged), and the dependencies between events can be due to both their timing and their types. [sent-12, score-0.36]

7 Modeling these dependencies is valuable for forecasting future events in applications such as system failure prediction for preemptive maintenance or forecasting web users’ future interests for targeted advertising. [sent-13, score-0.829]

8 This model captures the dependencies of each type of event on events in the past through a set of piecewise-constant conditional intensity functions. [sent-15, score-0.736]

9 We use decision trees to represent these dependencies and give a conjugate prior for this model, allowing for closed-form computation of the marginal likelihood and parameter posteriors. [sent-16, score-0.349]

10 Forecasting can be carried out using forward sampling for arbitrary ﬁnite duration queries. [sent-19, score-0.183]

11 For episodic sequence queries, that is, queries that specify particular sequences of events in given future time intervals, we develop a novel approach for estimating the probability of rare queries, which we call the Poisson Superposition Importance Sampler (PSIS). [sent-20, score-0.516]

12 Using synthetic data we show that PCIMs can correctly learn the underlying dependency structure of event streams, and that the PSIS leads to effective forecasting. [sent-22, score-0.243]

13 We then use real supercomputer event log data to show that PCIMs can be learned more than an order of magnitude faster than Poisson Networks [15, 18], and that they have better test set likelihood. [sent-23, score-0.357]

14 Finally, we show that PCIMs and the PSIS are useful in forecasting future interests of real web search users. [sent-24, score-0.386]

15 1 2 Related Work While graphical models such as Bayesian networks [2] and dependency networks [10] are widely used to model the dependencies between variables, they do not model temporal dependencies (see e. [sent-25, score-0.389]

16 Dynamic Bayesian Networks (DBN) [5, 9] allow modeling of temporal dependencies in discrete time. [sent-28, score-0.212]

17 In addition, allowing long term dependencies requires conditioning on multiple steps into the past, and choosing too fast a sampling rate increases the number of such steps that need to be conditioned on. [sent-31, score-0.183]

18 CTBNs are homogeneous Markov models of the joint trajectories of discrete ﬁnite variables, rather than models of event streams in continuous time [15]. [sent-33, score-0.38]

19 In contrast, CT-NOR and Poisson Cascades model event streams, but require the modeler to choose a parametric form for temporal dependencies. [sent-34, score-0.308]

20 PCIMs, in contrast to CT-NOR and Poisson Cascades, perform structure learning to learn how different events in the past affect future events. [sent-37, score-0.187]

21 An event is then comn posed of a non-negative time-stamp t and a label l. [sent-41, score-0.294]

22 An event sequence x = {(ti , li )}i=1 where 0 < t1 < · · · < tn . [sent-42, score-0.353]

23 The history at time t of event sequence x is the sub-sequence h(t, x) = {(ti , li ) | (ti , li ) ∈ x, ti ≤ t}. [sent-43, score-0.638]

24 We deﬁne the ending time t(x) of an event sequence x as the time of the last event in x: t(x) = max ({t : (t, l) ∈ x}) so that t(hi ) = ti−1 . [sent-46, score-0.53]

25 A Conditional Intensity Model (CIM) is a set of non-negative conditional intensity functions indexed by label {λl (t|x; θ)}l∈L . [sent-47, score-0.299]

26 The data likelihood for this model is n λl (ti |hi , θ)1l (li ) e−Λl (ti |hi ;θ) p(x|θ) = (1) l∈L i=1 t where Λl (t|x; θ) = −∞ λl (τ |x; θ)dτ for each event sequence x and the indicator function 1l (l ) is one if l = l and zero otherwise. [sent-48, score-0.335]

27 The conditional intensities are assumed to satisfy λl (t|x; θ) = 0 for t ≤ t(x) to ensure that ti > ti−1 = t(hi ). [sent-49, score-0.246]

28 Despite the fact that the modeling assumptions are weak, these models offer a powerful approach for decomposing the dependencies of different event types on the past. [sent-53, score-0.39]

29 1 Piecewise-Constant Conditional Intensity Models Piecewise-Constant Conditional Intensity Models (PCIMs) are Conditional Intensity Models where the conditional intensity functions are assumed to be piecewise-constant. [sent-56, score-0.248]

30 PCIMs are deﬁned in terms of local structures Sl for each label l, which specify regions in time where the corresponding conditional intensity function is constant, and local parameters θl for each label which specify the values taken in those regions. [sent-58, score-0.377]

31 The conditional intensity functions are deﬁned as λl (t|x) = λls with s = σl (t, x), and thus are piecewise constant. [sent-60, score-0.28]

32 The resulting data likelihood can be written as c (x) −λls dls (x) ls λls p(x|S, θ) = e (2) l∈L s∈Σl where S = {Sl }l∈L , θ = {θl }l∈L , cls (x) is the number of times label l occurs in x when the state function for l maps to state s (i. [sent-61, score-0.845]

33 , i 1l (li ) 1s (σl (ti , hi ))), and dls (x) is the total duration during t(x) which the state function for l maps to state s in the data x (i. [sent-63, score-0.308]

34 The conditional intensity for l is given by the regression λl (t|x, θ) = ewl ·σl (t,x) with parameter wl . [sent-70, score-0.223]

35 2 βls αls Γ(αls + cls (x)) Γ(αls ) (βls + dls (x))αls +cls (x) αls +cls (x) βls +dls (x) which is E [λls | x]. [sent-85, score-0.18]

36 Structure Learning with Decision Trees In this section, we specify the set of possible structures in terms of a set of basis state functions, a set of decision trees built from them, and a greedy Bayesian model selection procedure for learning a structure. [sent-86, score-0.22]

37 We use B to denote the set of basis state functions f (t, x), each taking values in a basis state set Σf . [sent-88, score-0.253]

38 The state function σl (t, x) is computed by recursively applying the basis state functions in the tree until a leaf is reached. [sent-91, score-0.19]

39 We begin with Sl being the trivial decision tree that maps all event sequences and times to the root. [sent-96, score-0.379]

40 In our experiments, we wish to learn how events depend on the timing and type of prior events. [sent-103, score-0.2]

41 In particular, we use binary basis state functions fl ,d1 ,d2 ,τ indexed by a label l ∈ L, two time offsets 0 ≤ d1 < d2 and a threshold τ > 0. [sent-105, score-0.19]

42 Such a f encodes whether or not the event sequence x contains at least τ events with label l with timestamps in the window [t − d2 , t − d1 ). [sent-106, score-0.503]

43 Examples of decision trees that use such basis state functions are shown in Figure 1. [sent-107, score-0.219]

44 5 Forecasting In this section, we describe how to use PCIMs to forecast whether a sequence of target labels will occur in a given order and in given time intervals. [sent-108, score-0.179]

45 We call such a sequence and set of associated intervals an k ∗ ∗ episodic sequence and denote it by e = lj , [a∗ , b∗ ) j=1 . [sent-110, score-0.307]

46 j j j j We say that the episodic sequence e occurs in an event sequence x if ∃i1 < · · · < ik : (tij , lij ) ∈ ∗ x, lij = lj , tij ∈ [a∗ , b∗ ). [sent-112, score-0.687]

47 The set of event sequences x in which e occurs is denoted Xe . [sent-113, score-0.372]

48 j j Given an event sequence h and a time t∗ ≥ t(h), we term any event sequence x whose history up to t∗ agrees with h (i. [sent-114, score-0.605]

49 Our forecasting problem is, given at observed sequence h at time t∗ ≥ t(h), to compute the probability that e occurs in extensions of h from t∗ . [sent-117, score-0.321]

50 We therefore give Monte Carlo estimates for p(Xe |h, t∗ ), ﬁrst describing a forward sampling procedure for forecasting episodic sequences (also applicable to other forecasting problems), and then introducing an importance sampling scheme speciﬁcally designed for forecasting episodic sequences. [sent-120, score-1.263]

51 1 Forward Sampling The probability of an episodic sequence can be estimated using a forward sampling approach by sampling M extensions {x(m) }M of h from t∗ and using the estimate pFwd (Xe |h, t∗ ; M ) = ˆ m=1 M 1 1Xe (x(m) ). [sent-122, score-0.43]

52 Forward k sampling consists of iteratively obtaining a sample sequence xi of length i by sampling (ti , li ) and appending to a prior sampled sequence xi−1 of length i − 1. [sent-128, score-0.329]

53 The CIM likelihood (Equation 1) of n an arbitrary event sequence x can be written as i=1 p(ti , li |hi ; θ). [sent-129, score-0.401]

54 The ﬁnite extension up to b∗ is obtained by terminating when ti > b∗ and rejecting ti . [sent-132, score-0.376]

55 A more detailed description of sampling tl from a piecewise constant i conditional intensities is given in [15]. [sent-134, score-0.196]

56 2 Importance Sampling When using a forward sampling approach to forecast unlikely episodic sequences, the episodes of interest will not occur in most of the sampled extensions and our estimate of p(Xe |h, t∗ ) will be noisy. [sent-137, score-0.419]

57 In fact, due to the fact that absolute error in pFwd falls as the square root of the numˆ ber of sequences sampled, we would need O(1/p(Xe |h, t∗ )2 ) sample sequences to get non-trivial lower bounds on p(Xe |h, t∗ ) using a forward sampling approach. [sent-138, score-0.339]

58 To mitigate this problem we develop an importance sampling approach, where sequences are drawn from a proposal distribution q(·) that has an increased likelihood of generating extensions in which Xe occurs, and then uses a weighted empirical estimate. [sent-139, score-0.356]

59 j(x) 0 otherwise, where the active episode j(x) is 0 if t(x) ≥ bj (x), j = 1, · · · , k and is min ({j : bj (x) > t(x)}) otherwise. [sent-142, score-0.213]

60 The time bj (x) when the j th episode ceases to be active is the time at which the j th episode occurs in x, or b∗ if it does not occur. [sent-143, score-0.209]

61 If the episodic intervals [a∗ , b∗ ) do not overlap, j j j aj (x) = a∗ . [sent-144, score-0.211]

62 In general aj (x) and bj (x) are given by the recursion j aj (x) = max {a∗ , bj−1 (x)} j ∗ bj (x) = min {b∗ } ∪ {(ti , li ) ∈ x : li = lj , ti ∈ [aj (x), b∗ )} . [sent-145, score-0.744]

63 As the proposal distribution is also a CIM, importance sampling can be done using the forward sampling procedure above. [sent-147, score-0.343]

64 In practice the computation of j(x), aj (x), and bj (x) can be l done during forward sampling. [sent-149, score-0.255]

65 The importance weight corresponding to our proposal distribution is k w(x) = exp j=1 bj (x) − aj (x) b∗ − aj (x) j ∗ λlj (ti |xi ) (ti ,li )∈x: ∗ ti =bj (x),li =lj 5 ∗ λlj (ti |xi ) + 1 b∗ −aj (x) j . [sent-150, score-0.554]

66 6 Experimental Results We ﬁrst validate that PCIMs can learn temporal dependencies and that the PSIS gives faster forecasting than forward sampling using a synthetic data set. [sent-155, score-0.544]

67 Finally we show that PCIMs and the PSIS allow the forecasting future interests of web search users using real log data from a major commercial search engine. [sent-157, score-0.467]

68 1 Validation on Synthetic Data In order to evaluate the ability of PCIMs to learn nonlinear temporal dependencies we sampled data from a known model and veriﬁed that the dependencies learned were correct. [sent-159, score-0.293]

69 002 C in [t-1,t) yes no B in [t-2,t-1) yes λ=0. [sent-166, score-0.188]

70 We used basis state functions that tested for the presence of each label in windows with boundaries at t − 0, 1, 2, · · · , 10, and +∞ time units. [sent-176, score-0.19]

71 We evaluated the PSIS in forecasting event sequences with the model shown in Figure 1. [sent-184, score-0.542]

72 The convergence of importance sampling is compared with that of forward sampling in Figure 2. [sent-185, score-0.29]

73 We give results for forecasting three different episodic sequences, consisting of the label sequences {C}, {C, B}, and {C, B, A}, all in the interval [0, 1], given an empty history. [sent-186, score-0.479]

74 We show how estimates of the probabilities of given episodic sequences vary as a function of the number of sequences sampled, giving the mean and variance of the trajectories of the estimates computed over ten runs. [sent-188, score-0.311]

75 Since exact inference is infeasible for this model, we forward sample 4,000,000 event sequences and display this estimate. [sent-190, score-0.422]

76 The solid line shows pFwd based on 4 million event sequences. [sent-194, score-0.243]

77 This further suggests the need for importance sampling for rare label sequences. [sent-197, score-0.184]

78 2 Modeling Supercomputer Event Logs We compared PCIM and Poisson Nets on the task of modeling system event logs from the BlueGene/L supercomputer at Lawrence Livermore National Laboratory [14], available at the USENIX Computer Failure Data Repository. [sent-199, score-0.489]

79 We ﬁltered out informational (non-alert) messages from the logs, and randomly split the events by node into a training set with 311,060 alerts from 21,962 nodes, and a test set with 68,502 alerts from 9,412 nodes. [sent-200, score-0.188]

80 Both models used priors with a mean rate of an event every 100 days, no dependencies, and an equivalent sample size of one second. [sent-205, score-0.243]

81 8 Training Time 11 min 3 hr 33 min Table 1: A comparison of the PCIM and Poisson Net in modeling supercomputer event logs. [sent-212, score-0.39]

82 3 Forecasting Future Interests of Web Search Users We used the query logs of a major internet search engine to investigate the use of PCIMs in forecasting the future interests of web search users. [sent-216, score-0.547]

83 ” Our training set contains event sequences for approximately 23k users consisting of about 385k timestamped labels recorded over a two month period. [sent-219, score-0.453]

84 The test set contains event sequences for approximately 11k users of about 160k timestamped labels recorded over the next month. [sent-220, score-0.453]

85 We trained a PCIM on the training data using window boundaries at t − 1 hour, t − 1 day, and t − 1 week, and basis state functions that tested for the presence of one or more instance of each label in each window, treating users as i. [sent-221, score-0.239]

86 The prior had a mean rate of an event every year, an equivalent sample size of one day. [sent-224, score-0.28]

87 7 Figure 3: Precision-recall curves for forecasting future Health & Wellness queries using a full PCIM, a restricted PCIM that conditions only on past Health & Wellness queries, a baseline that takes into account only past Health & Wellness queries and not their timing, and random guessing. [sent-229, score-0.453]

88 Given the ﬁrst week of each test user’s event sequence, we forecasted whether they would issue a query in a chosen target category in the second week. [sent-230, score-0.348]

89 Also shown is the result for restricted PCIMs that only model dependencies on prior occurrences of the target category. [sent-233, score-0.207]

90 This is compared to a baseline where the conditional intensity depends only on whether the target label appeared in the history. [sent-234, score-0.3]

91 This shows that modeling the temporal aspect of dependencies does provide a large improvement. [sent-235, score-0.212]

92 Modeling dependencies on past occurrences of other labels also provides an improvement in the right-hand region of the precision-recall curve. [sent-236, score-0.215]

93 As Figure 3 suggests (but doesn’t show), the PCIM can model crosslabel dependencies to forecast the ﬁrst occurrence of the target label. [sent-238, score-0.212]

94 7 Discussion We presented the Piecewise-Constant Conditional Intensity Model, which is a model of temporal dependencies in continuous time event streams. [sent-241, score-0.454]

95 For example, basis state functions could encode the most recent events that have occurred in the history rather than the events that occurred in windows of interest. [sent-249, score-0.414]

96 We also presented the Poisson Superposition Importance Sampler for forecasting episodic sequences with PCIMs. [sent-252, score-0.428]

97 Developing forecasting algorithms for more general queries is of interest. [sent-253, score-0.281]

98 Finally, we demonstrated the value of PCIMs in modeling the temporal behavior of web search users and of supercomputer nodes. [sent-254, score-0.346]

99 In many applications, we have access to richer event streams such as spatio-temporal event streams and event streams with structured labels. [sent-255, score-1.044]

100 It would be interesting to extend PCIMs to handle such rich event streams. [sent-256, score-0.243]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ls', 0.426), ('pcims', 0.375), ('pcim', 0.261), ('event', 0.243), ('forecasting', 0.208), ('xe', 0.198), ('ti', 0.188), ('poisson', 0.17), ('intensity', 0.165), ('psis', 0.13), ('sl', 0.129), ('episodic', 0.129), ('events', 0.122), ('supercomputer', 0.114), ('dependencies', 0.114), ('streams', 0.105), ('logs', 0.099), ('hi', 0.098), ('cls', 0.098), ('pfwd', 0.098), ('yes', 0.094), ('sequences', 0.091), ('lj', 0.09), ('forward', 0.088), ('bj', 0.085), ('aj', 0.082), ('dls', 0.082), ('queries', 0.073), ('forecast', 0.072), ('sampling', 0.069), ('li', 0.066), ('cascades', 0.066), ('cim', 0.065), ('temporal', 0.065), ('importance', 0.064), ('basis', 0.063), ('interests', 0.062), ('conditional', 0.058), ('web', 0.053), ('proposal', 0.053), ('label', 0.051), ('state', 0.051), ('week', 0.049), ('users', 0.049), ('meek', 0.049), ('pimp', 0.049), ('simma', 0.049), ('uri', 0.049), ('wellness', 0.049), ('networks', 0.048), ('likelihood', 0.048), ('decision', 0.045), ('sequence', 0.044), ('episode', 0.043), ('timestamps', 0.043), ('timing', 0.041), ('uai', 0.041), ('superposition', 0.04), ('marginal', 0.039), ('occurs', 0.038), ('labels', 0.037), ('risks', 0.037), ('tl', 0.037), ('health', 0.037), ('prior', 0.037), ('trees', 0.035), ('past', 0.034), ('bayesian', 0.034), ('modeling', 0.033), ('aleksandr', 0.033), ('alerts', 0.033), ('ctbns', 0.033), ('lij', 0.033), ('nodelman', 0.033), ('tij', 0.033), ('timestamped', 0.033), ('search', 0.032), ('piecewise', 0.032), ('continuous', 0.032), ('conjugate', 0.031), ('extensions', 0.031), ('history', 0.031), ('future', 0.031), ('query', 0.03), ('christian', 0.03), ('episodes', 0.03), ('occurrences', 0.03), ('lsm', 0.029), ('redmond', 0.029), ('shelton', 0.029), ('sampler', 0.028), ('microsoft', 0.027), ('chickering', 0.026), ('heckerman', 0.026), ('maxwell', 0.026), ('target', 0.026), ('specify', 0.026), ('duration', 0.026), ('net', 0.025), ('functions', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

2 0.12113745 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

Abstract: Renewal processes are generalizations of the Poisson process on the real line whose intervals are drawn i.i.d. from some distribution. Modulated renewal processes allow these interevent distributions to vary with time, allowing the introduction of nonstationarity. In this work, we take a nonparametric Bayesian approach, modelling this nonstationarity with a Gaussian process. Our approach is based on the idea of uniformization, which allows us to draw exact samples from an otherwise intractable distribution. We develop a novel and efﬁcient MCMC sampler for posterior inference. In our experiments, we test these on a number of synthetic and real datasets. 1

3 0.10458764 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

Author: Rina Foygel, Ohad Shamir, Nati Srebro, Ruslan Salakhutdinov

Abstract: We provide rigorous guarantees on learning with the weighted trace-norm under arbitrary sampling distributions. We show that the standard weighted-trace norm might fail when the sampling distribution is not a product distribution (i.e. when row and column indexes are not selected independently), present a corrected variant for which we establish strong learning guarantees, and demonstrate that it works better in practice. We provide guarantees when weighting by either the true or empirical sampling distribution, and suggest that even if the true distribution is known (or is uniform), weighting by the empirical distribution may be beneﬁcial. 1

4 0.099093325 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

5 0.081664845 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

Abstract: We consider the problem of Bayesian inference for continuous-time multi-stable stochastic systems which can change both their diffusion and drift parameters at discrete times. We propose exact inference and sampling methodologies for two speciﬁc cases where the discontinuous dynamics is given by a Poisson process and a two-state Markovian switch. We test the methodology on simulated data, and apply it to two real data sets in ﬁnance and systems biology. Our experimental results show that the approach leads to valid inferences and non-trivial insights. 1

6 0.074549213 229 nips-2011-Query-Aware MCMC

7 0.067744628 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

8 0.066418685 302 nips-2011-Variational Learning for Recurrent Spiking Networks

9 0.065253645 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

10 0.063974075 221 nips-2011-Priors over Recurrent Continuous Time Processes

11 0.062965453 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.058425233 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

13 0.057611305 275 nips-2011-Structured Learning for Cell Tracking

14 0.053554479 219 nips-2011-Predicting response time and error rates in visual search

15 0.053536274 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

16 0.053298082 249 nips-2011-Sequence learning with hidden units in spiking neural networks

17 0.053149451 226 nips-2011-Projection onto A Nonnegative Max-Heap

18 0.05047198 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

19 0.050344273 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

20 0.048933491 22 nips-2011-Active Ranking using Pairwise Comparisons

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, 0.004), (2, 0.051), (3, 0.006), (4, -0.023), (5, -0.086), (6, -0.051), (7, -0.077), (8, -0.015), (9, -0.003), (10, -0.041), (11, -0.076), (12, 0.002), (13, 0.043), (14, 0.045), (15, -0.016), (16, 0.003), (17, -0.018), (18, -0.018), (19, 0.012), (20, 0.036), (21, 0.035), (22, 0.058), (23, -0.1), (24, -0.092), (25, 0.051), (26, -0.098), (27, 0.088), (28, 0.072), (29, -0.022), (30, 0.014), (31, 0.018), (32, -0.035), (33, -0.008), (34, -0.04), (35, 0.053), (36, 0.032), (37, -0.016), (38, 0.038), (39, 0.046), (40, 0.059), (41, 0.058), (42, -0.068), (43, 0.099), (44, 0.018), (45, 0.178), (46, 0.092), (47, 0.028), (48, -0.019), (49, -0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92730671 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

2 0.63531315 101 nips-2011-Gaussian process modulated renewal processes

Author: Yee W. Teh, Vinayak Rao

3 0.62558097 221 nips-2011-Priors over Recurrent Continuous Time Processes

Author: Ardavan Saeedi, Alexandre Bouchard-côté

Abstract: We introduce the Gamma-Exponential Process (GEP), a prior over a large family of continuous time stochastic processes. A hierarchical version of this prior (HGEP; the Hierarchical GEP) yields a useful model for analyzing complex time series. Models based on HGEPs display many attractive properties: conjugacy, exchangeability and closed-form predictive distribution for the waiting times, and exact Gibbs updates for the time scale parameters. After establishing these properties, we show how posterior inference can be carried efﬁciently using Particle MCMC methods [1]. This yields a MCMC algorithm that can resample entire sequences atomically while avoiding the complications of introducing slice and stick auxiliary variables of the beam sampler [2]. We applied our model to the problem of estimating the disease progression in multiple sclerosis [3], and to RNA evolutionary modeling [4]. In both domains, we found that our model outperformed the standard rate matrix estimation approach. 1

4 0.6051203 131 nips-2011-Inference in continuous-time change-point models

Author: Florian Stimberg, Manfred Opper, Guido Sanguinetti, Andreas Ruttor

5 0.58746457 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

6 0.55364233 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

7 0.54825079 55 nips-2011-Collective Graphical Models

8 0.535312 229 nips-2011-Query-Aware MCMC

9 0.48972756 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes

10 0.47735953 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

11 0.4690595 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

12 0.45508233 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

13 0.43471047 197 nips-2011-On Tracking The Partition Function

14 0.4268662 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

15 0.42560527 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

16 0.4227927 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

17 0.3951233 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

18 0.38742912 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression

19 0.38671139 33 nips-2011-An Exact Algorithm for F-Measure Maximization

20 0.38152948 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.016), (20, 0.013), (26, 0.031), (31, 0.078), (33, 0.014), (43, 0.062), (45, 0.058), (57, 0.036), (74, 0.026), (83, 0.049), (99, 0.509)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94635463 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector ﬁeld perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efﬁciently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

2 0.92662346 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper deﬁnes a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both ofﬂine and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

3 0.90972424 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to ﬁnd a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efﬁcacy of our proposed approaches.

same-paper 4 0.89077592 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

5 0.84082776 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

6 0.79082167 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

7 0.564533 102 nips-2011-Generalised Coupled Tensor Factorisation

8 0.5519281 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

9 0.54948831 186 nips-2011-Noise Thresholds for Spectral Clustering

10 0.52908075 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.51712316 29 nips-2011-Algorithms and hardness results for parallel large margin learning

12 0.51683015 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

13 0.51506042 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

14 0.5082798 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

15 0.50356114 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

16 0.4967756 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

17 0.49402145 66 nips-2011-Crowdclustering

18 0.48709327 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

19 0.48516554 72 nips-2011-Distributed Delayed Stochastic Optimization

20 0.48505282 198 nips-2011-On U-processes and clustering performance