nips nips2012 nips2012-60 knowledge-graph by maker-knowledge-mining

60 nips-2012-Bayesian nonparametric models for ranked data

Source: pdf

Author: Francois Caron, Yee W. Teh

Abstract: We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an inﬁnite number of choice items. Our framework is based on the theory of random atomic measures, with the prior speciﬁed by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an inﬁnite number of choice items. [sent-8, score-0.179]

2 Our framework is based on the theory of random atomic measures, with the prior speciﬁed by a gamma process. [sent-9, score-0.399]

3 We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. [sent-10, score-0.256]

4 We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books. [sent-11, score-0.218]

5 in terms of an ordered list of the top-m items, arise in many contexts. [sent-14, score-0.16]

6 For example, in this paper we consider datasets consisting of the top 20 bestselling books as published each week by the New York Times. [sent-15, score-0.138]

7 The Plackett-Luce model [1, 2] is a popular model for modeling such partial rankings of a ﬁnite collection of M items. [sent-16, score-0.299]

8 It has found many applications, including choice modeling [3], sport ranking [4], and voting [5]. [sent-17, score-0.153]

9 In the Plackett-Luce model, each item k ∈ [M ] = {1, . [sent-20, score-0.195]

10 , M } is assigned a positive rating parameter wk , which represents the desirability or rating of a product in the case of choice modeling, or the skill of a player in sport rankings. [sent-23, score-0.567]

11 The Plackett-Luce model assumes the following generative story for a top-m list ρ = (ρ1 , . [sent-24, score-0.132]

12 , ρm ) of items ρi ∈ [M ]: At each stage i = 1, . [sent-27, score-0.23]

13 , m, an item is chosen to be the ith item in the list from among the items that have not yet appeared, with the probability that ρi is selected being proportional to its desirability wρi . [sent-30, score-0.92]

14 The overall probability of a given partial m ranking ρ is then: wρi P (ρ) = . [sent-31, score-0.185]

15 (1) M i−1 k=1 wk − j=1 wρj i=1 with the denominator in (1) being the sum over all items not yet selected at stage i. [sent-32, score-0.568]

16 In many situations the collection of available items can be very large and potentially unknown. [sent-33, score-0.23]

17 In this case, a nonparametric approach can be sensible, where the pool of items is assumed to be inﬁnite and the model allows for the possibility of items not observed in previous top-m lists to appear in new ones. [sent-34, score-0.732]

18 Our model assumes the existence of an inﬁnite pool of items {Xk }∞ , each with its own rating parameter, {wk }∞ . [sent-37, score-0.304]

19 The probability of a top-m list of items, k=1 k=1 say (Xρ1 , . [sent-38, score-0.154]

20 Described this way, it is clear that the Plackett-Luce model is basically a partial size-biased permutation of the atoms in G [10], and the existing machinery of random measures and exchangeable random partitions [11] can be brought to bear on our problem. [sent-46, score-0.267]

21 In particular, in Section 2 we will use a gamma process as the prior over the atomic measure G. [sent-47, score-0.466]

22 This is a completely random measure [12] with gamma marginals, such that the corresponding normalized probability measure is a Dirichlet process. [sent-48, score-0.456]

23 We will show that with the introduction of a suitable set of auxiliary variables, we can characterize the posterior law of G given observations of top-m lists distributed according to (2). [sent-49, score-0.394]

24 A simple Gibbs sampler can then be derived to simulate from the posterior distribution. [sent-50, score-0.175]

25 In Section 3 we develop a time-varying extension of our model and derive a simple and effective Gibbs sampler for posterior simulation. [sent-51, score-0.175]

26 In Section 4 we apply our time-varying Bayesian nonparametric Plackett- Luce model to the aforementioned New York Times bestsellers datasets, and conclude in Section 5. [sent-52, score-0.193]

27 2 A Bayesian nonparametric model for partial ranking We start this section by brieﬂy describing a Bayesian approach to inference in ﬁnite Plackett-Luce models [9], and taking the inﬁnite limit to arrive at the nonparametric model. [sent-53, score-0.452]

28 This will give good intuitions for how the model operates, before we rederive the same nonparametric model more formally using gamma processes. [sent-54, score-0.417]

29 For notational simplicity we assume that all the partial rankings are length m. [sent-59, score-0.299]

30 1 Finite Plackett-Luce model with gamma prior Suppose we have M choice items, with item k ∈ [M ] having a positive desirability parameter wk . [sent-61, score-0.97]

31 , ρ m ) can be constructed generatively by picking the ith item ρ i at the ith stage for i = 1, . [sent-65, score-0.283]

32 An alternative Thurstonian interpretation, which will be important in the following, is as follows: For each item k let z k ∼ Exp(wk ) be exponentially distributed with rate wk . [sent-69, score-0.533]

33 Thinking of z k as the arrival time of item k in a race, let ρ i be the index of the ith item to arrive (the ith smallest value among (z k )M ). [sent-70, score-0.58]

34 Unfortunately the posterior k=1 distribution of (z k ) given ρ is difﬁcult to compute directly, so we instead consider an alternative parameterization: Let Z i = zρ i − zρ i−1 be the waiting time for the ith item to arrive after the i − 1th item (with zρ 0 deﬁned to be 0). [sent-73, score-0.562]

35 Taking a further step, j=1 α we note that a factorized gamma prior over (wk ) is conjugate to (4), say wk ∼ Gamma( M , τ ) with hyperparameters α, τ > 0. [sent-75, score-0.671]

36 In this case the parameter updates are of the form wk |(ρ ), (Z i ), (wk )k =k ∼ Gamma α M + nk , τ + L =1 m i=1 δ ik Z i (5) where nk is the number of occurrences of item k among the observed partial rankings, and δ ik = 0 if there is a j < i with ρ j = k and 1 otherwise. [sent-78, score-1.063]

37 A nonparametric Plackett-Luce model can now be easily derived by taking the limit as the number of choice items M → ∞. [sent-80, score-0.386]

38 For those items k that have appeared among the observed partial rankings, 2 the limiting conditional distribution (5) is well deﬁned since nk > 0. [sent-81, score-0.514]

39 For items that did not appear in the observations, (5) becomes degenerate at 0. [sent-82, score-0.25]

40 This nonparametric model allows us to estimate the probability of seeing new items appearing in future partial rankings in a consistent manner. [sent-84, score-0.659]

41 A gamma process is a completely random measure over X with gamma marginals. [sent-88, score-0.709]

42 Speciﬁcally, it is a random atomic measure of the form (3), such that for each measurable subset A, the (random) mass G(A) is gamma distributed. [sent-89, score-0.464]

43 Each atom Xk is a choice item, with its mass wk > 0 corresponding to the desirability parameter. [sent-93, score-0.669]

44 The Thurstonian view described in the ﬁnite model can be easily extended to the nonparametric one, where a partial ranking (Xρ 1 . [sent-94, score-0.271]

45 Xρ m ) can be generated as the ﬁrst m items to arrive in a race. [sent-97, score-0.279]

46 In particular, for each atom Xk let z k ∼ Exp(wk ) be the time of arrival of Xk and Xρ i the ith item to arrive. [sent-98, score-0.444]

47 , ρ ∞ m m e−wk zρ m wρ i exp − Z = k∈{ρ i }m i=1 i=1 wk − i k=1 m }) (7) i−1 wρ j j=1 Marginalizing out (Z i )m gives the probability of (Xρ i )m in (2). [sent-113, score-0.401]

48 Z m are mutually independent and exponentially distributed: ∞ i−1 Z i |(Xρ i )m , G ∼ Exp wk − wρ j (8) i=1 k=1 j=1 The above construction is depicted on Figure 1(left). [sent-117, score-0.394]

49 3 Posterior characterization Consider a number L of partial rankings, with the th list denoted Y = (Y 1 . [sent-120, score-0.25]

50 Xρm ) consists of an ordered list of the atoms in G. [sent-127, score-0.292]

51 Y m ) is simply a list of observed choice items, which is why they were not expressed as an ordered list of atoms in G. [sent-131, score-0.447]

52 The task here is then to characterize the posterior law of G under a gamma process prior and supposing that the observed partial rankings were drawn iid from the nonparametric Plackett-Luce model given G. [sent-132, score-0.963]

53 Right: Visualization of top-5 rankings with rows corresponding to different rankings and columns to items sorted by size biased order. [sent-139, score-0.646]

54 , Y i−1 })) m ) (the (10) The joint probability of the item lists and auxiliary variables is then (c. [sent-149, score-0.421]

55 2, there is positive probability that an item appearing in a list Y appears in another list Y with = . [sent-155, score-0.481]

56 Denote the unique items among all ∗ ∗ ∗ L lists by X1 . [sent-156, score-0.365]

57 , K let nk be the number of occurrences of Xk among the item lists. [sent-162, score-0.349]

58 δ ik is the indicator of the occurence that item Xk does not appear at a rank lower than i in the th list. [sent-166, score-0.303]

59 Another application of the Palm formula now allows us to derive a posterior characterisation of G: 4 Theorem 2 Given the observations and associated auxiliary variables (Y , Z )L , the posterior =1 law of G is also a gamma process, but with atoms with both ﬁxed and random locations. [sent-171, score-0.823]

60 Speciﬁcally, K G|(Y , Z )L =1 ∗ ∗ ∗ wk δXk =G + (17) k=1 ∗ ∗ where G∗ and w1 , . [sent-172, score-0.338]

61 The law of G∗ is still a gamma process, G∗ |(X , Z )L ∼ Γ(α, τ ∗ , h) =1 τ∗ = τ + Z (18) i i while the masses have distributions, ∗ wk |(Y , Z )L ∼ Gamma nk , τ + =1 δ ik Z i (19) i 2. [sent-176, score-1.129]

62 This leaves the latent variables to consist of the masses w∗ , (wk ) and the auxiliary variables (Z i ). [sent-179, score-0.286]

63 Hyperparameters of the gamma process can be simply derived from the joint distribution in Theorem 1. [sent-181, score-0.366]

64 Since the marginal probability of the partial rankings is invariant to rescaling of the masses, it is sufﬁcient to keep τ ﬁxed at 1. [sent-182, score-0.321]

65 3 Dynamic Bayesian nonparametric ranking models In this section we develop an extension of the Bayesian nonparametric Plackett-Luce model to model time-varying rankings, where the rating parameters of items may change smoothly over time and reﬂected in a changing series of rankings. [sent-184, score-0.587]

66 , we may model the rankings at time t using a gamma process distributed random measure Gt as in Section 2. [sent-188, score-0.584]

67 2, with Markov dependence among the sequence of measures (Gt ) enabling dependence among the rankings over time. [sent-189, score-0.38]

68 1 Pitt-Walker dependence model We will construct a dependent sequence (Gt ) which marginally follow a gamma process Γ(α, τ, H) using the construction of [13]. [sent-191, score-0.384]

69 Since Gt is atomic, we can write it in the ∞ form: Gt = wtk δXtk (24) k=1 Deﬁne a random measure Ct with conditional law: ∞ Ct |Gt = ctk |Gt ∼ Poisson(φt wtk ) ctk δXtk (25) k=1 where φt > 0 is a dependence parameter. [sent-193, score-0.915]

70 The conditional law of Gt given Ct is then: ∞ Gt = G∗ t ∗ wtk δXtk + (26) k=1 ∗ where G∗ and (wtk )∞ are all mutually independent. [sent-196, score-0.391]

71 If t+1 the prior law of Gt is Γ(α, τ, H), the marginal law of Gt+1 will be Γ(α, τ, H) as well when both Gt and Ct are marginalized out, thus maintaining a form of stationarity. [sent-199, score-0.248]

72 Further, although we have described the process in order of increasing t, the joint law of Gt , Ct , Gt+1 can equivalently be described in the reverse order with the same conditional laws as above. [sent-200, score-0.177]

73 Note that if ctk = 0, the conditional distribution of wt+1,k will be degenerate at 0. [sent-201, score-0.245]

74 Hence Gt+1 has an atom at Xtk if and only if Ct has an atom at Xtk , that is, if ctk > 0. [sent-202, score-0.558]

75 The probability that X is an atom in C2 with positive mass is 1 − exp(−φ1 w), in which case it has positive mass in G2 as well. [sent-208, score-0.303]

76 Conversely, once it is not an atom, it will never be an atom in the future since the base distribution H is non-atomic. [sent-209, score-0.173]

77 The lifetime of the atom is then the smallest t such that it is no longer an atom. [sent-210, score-0.222]

78 We can show by induction that: (details in supplementary material) Proposition 4 The probability that an atom X in G1 with mass w > 0 is dead at time t is given by P (Gt ({X}) = 0|w) = exp(−yt|1 w) where yt|1 can be obtained by the recurrence yt|t−1 = φt−1 and yt|s−1 = 3. [sent-211, score-0.249]

79 , Ytm ) (it trivially extends to multiple partial rankings of differing sizes). [sent-219, score-0.299]

80 We extend the results of the previous section in characterizing the posterior and developing a Gibbs sampler for the dynamical model. [sent-220, score-0.15]

81 , K be the set of unique items observed in ∗ Y1 , . [sent-225, score-0.23]

82 , YT , let ntk ∈ {0, 1} be the number of times the item Xk appears at time t, and let ρt be ∗ ∗ ∗ deﬁned as Yt = (Xρ1 , . [sent-228, score-0.195]

83 We write the masses of the ﬁxed atoms as wtk = Gt ({Xk }), ∗ while the total mass of all other random atoms is denoted wt∗ = Gt (X\X ). [sent-232, score-0.681]

84 Note that wtk has ∗ to be positive on a random contiguous interval of time that includes all observations of Xk —it’s ∗ lifetime—but is zero outside of the interval. [sent-233, score-0.212]

85 We also write ctk = Ct ({Xk }) and ct∗ = Ct (X\X ∗ ). [sent-234, score-0.212]

86 , m, latent variables K Zti ∼ Exp wt∗ + i−1 wtk − k=1 6 wtρj j=1 (30) Figure 2: Sample path drawn from the Dawson-Watanabe superprocess. [sent-241, score-0.213]

87 Conditioned on the latent variables (Zti ), (ctk ) and (ct∗ ), we update the masses (wtk ), which are independent and gamma distributed since all likelihoods are of gamma form. [sent-246, score-0.856]

88 Note that the total masses (Gt (X)) are not likelihood identiﬁable, so we introduce an extra step to improve mixing by sampling them from the prior (integrating out (ctk ), (ct∗ )), scaling all masses along with it. [sent-247, score-0.37]

89 We update α along with the random masses (wt∗ ) and (ct∗ ) efﬁciently using a forward-backward recursion. [sent-249, score-0.215]

90 When the time interval between ranking observations is not constant, it is desirable to work with dynamic models evolving over continuous-time instead, with the underlying random measures (Gt ) deﬁned over all t ∈ R, but with observations at a discrete set of times t1 < t2 < · · · . [sent-253, score-0.182]

91 This is a diffusion on the space of measures with the gamma process Γ(α, τ, H) as its equilibrium distribution. [sent-255, score-0.415]

92 These consist of the weekly top-20 best-sellers list from June 2008 to April 2012 in various categories. [sent-266, score-0.169]

93 We consider here the categories paperback nonﬁction (PN) and hardcover ﬁction (HF), for which respectively 249 and 916 books appear at least once in the top-20 lists over the 200 weeks. [sent-267, score-0.349]

94 In order to take into account the publication date of a book, we do not consider books in the likelihood before their ﬁrst appearance in a list. [sent-269, score-0.144]

95 Our approach is based on the theory of atomic random measures, where we showed that the Plackett-Luce generative model corresponds exactly to a size-biased permutation of the atoms in the random measure. [sent-289, score-0.198]

96 We characterized the posterior distribution, and derived a simple MCMC sampling algorithm for posterior simulation. [sent-290, score-0.183]

97 Our approach can be see as a multi-stage generalization of posterior inference in normalized random measures [21, 22, 23], and can be easily extended from gamma processes to general completely random measures. [sent-291, score-0.511]

98 In our experiments we found that our model is insufﬁcient to capture the empirical observation that bestsellers often start off high on the lists and tail off afterwards, since our model has continuous sample paths. [sent-294, score-0.199]

99 We adjusted for this by simply not including books in the model prior to their publication date. [sent-295, score-0.147]

100 The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for Bayesian nonparametric inference. [sent-437, score-0.14]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gt', 0.355), ('wk', 0.338), ('gamma', 0.309), ('items', 0.23), ('ctk', 0.212), ('rankings', 0.208), ('item', 0.195), ('wtk', 0.19), ('ct', 0.176), ('masses', 0.173), ('atom', 0.173), ('xk', 0.146), ('atoms', 0.132), ('list', 0.132), ('xtk', 0.127), ('lists', 0.114), ('law', 0.112), ('gibbs', 0.11), ('nk', 0.109), ('nonparametric', 0.108), ('books', 0.096), ('ction', 0.093), ('partial', 0.091), ('ik', 0.088), ('bestsellers', 0.085), ('desirability', 0.081), ('posterior', 0.079), ('ranking', 0.072), ('sampler', 0.071), ('yt', 0.068), ('auxiliary', 0.067), ('atomic', 0.066), ('paperback', 0.063), ('bayesian', 0.057), ('mutually', 0.056), ('hardcover', 0.056), ('zti', 0.056), ('mass', 0.054), ('wt', 0.053), ('zw', 0.052), ('branching', 0.052), ('arrive', 0.049), ('lifetime', 0.049), ('measures', 0.044), ('rating', 0.044), ('ith', 0.044), ('dependence', 0.043), ('bestselling', 0.042), ('ethier', 0.042), ('gts', 0.042), ('superprocess', 0.042), ('thurstonian', 0.042), ('update', 0.042), ('exp', 0.041), ('book', 0.039), ('palm', 0.037), ('sport', 0.037), ('weekly', 0.037), ('measure', 0.035), ('grif', 0.034), ('durations', 0.034), ('conditional', 0.033), ('arrival', 0.032), ('caron', 0.032), ('gormley', 0.032), ('process', 0.032), ('normalized', 0.031), ('diffusion', 0.03), ('pool', 0.03), ('unobserved', 0.03), ('appeared', 0.03), ('non', 0.03), ('ordered', 0.028), ('publication', 0.027), ('characterization', 0.027), ('dw', 0.026), ('extension', 0.025), ('derived', 0.025), ('occurrences', 0.024), ('prior', 0.024), ('completely', 0.024), ('york', 0.024), ('inference', 0.024), ('conditionally', 0.024), ('variables', 0.023), ('choice', 0.023), ('dx', 0.023), ('probability', 0.022), ('observations', 0.022), ('nite', 0.022), ('ts', 0.022), ('dynamic', 0.022), ('suppose', 0.022), ('voting', 0.021), ('ratings', 0.021), ('united', 0.021), ('date', 0.021), ('among', 0.021), ('ranked', 0.021), ('appear', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

2 0.18894519 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative ﬁltering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difﬁcult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative ﬁltering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efﬁcient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1

3 0.17952155 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

Author: Nicholas Foti, Sinead Williamson

Abstract: A number of dependent nonparametric processes have been proposed to model non-stationary data with unknown latent dimensionality. However, the inference algorithms are often slow and unwieldy, and are in general highly speciﬁc to a given model formulation. In this paper, we describe a large class of dependent nonparametric processes, including several existing models, and present a slice sampler that allows efﬁcient inference across this class of models. 1

4 0.17221382 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

Author: Thomas Furmston, David Barber

Abstract: Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being popular methods in this ﬁeld. In this article we provide a unifying perspective of these two algorithms by showing that their searchdirections in the parameter space are closely related to the search-direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative optimisation method for Markov Decision Processes. We are able to show that the algorithm has numerous desirable properties, absent in the naive application of Newton’s method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent. 1 Markov Decision Processes Markov Decision Processes (MDPs) are the most commonly used model for the description of sequential decision making processes in a fully observable environment, see e.g. [5]. A MDP is described by the tuple {S, A, H, p1 , p, π, R}, where S and A are sets known respectively as the state and action space, H ∈ N is the planning horizon, which can be either ﬁnite or inﬁnite, and {p1 , p, π, R} are functions that are referred as the initial state distribution, transition dynamics, policy (or controller) and the reward function. In general the state and action spaces can be arbitrary sets, but we restrict our attention to either discrete sets or subsets of Rn , where n ∈ N. We use boldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: Given the agent’s state, st , at a given time-point, t ∈ NH , an action is selected according to the policy, at ∼ π(·|st ); The agent will then transition to a new state according to the transition dynamics, st+1 ∼ p(·|at , st ); this process is iterated sequentially through all of the time-points in the planning horizon, where the state of the initial time-point is determined by the initial state distribution s1 ∼ p1 (·). At each time-point the agent receives a (scalar) reward that is determined by the reward function, where this function depends on the current action and state of the environment. Typically the reward function is assumed to be bounded, but as the objective is linear in the reward function we assume w.l.o.g that it is non-negative. The most widely used objective in the MDP framework is to maximise the total expected reward of the agent over the course of the planning horizon. This objective can take various forms, including an inﬁnite planning horizon, with either discounted or average rewards, or a ﬁnite planning horizon. The theoretical contributions of this paper are applicable to all three frameworks, but for notational ease and for reasons of space we concern ourselves with the inﬁnite horizon framework with discounted rewards. In this framework the boundedness of the objective function is ensured by the 1 introduction of a discount factor, γ ∈ [0, 1), which scales the rewards of the various time-points in a geometric manner. Writing the objective function and trajectory distribution directly in terms of the parameter vector then, for any w ∈ W, the objective function takes the form ∞ Ept (a,s;w) γ t−1 R(a, s) , U (w) = (1) t=1 where we have denoted the parameter space by W and have used the notation pt (a, s; w) to represent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution H−1 p(a1:H , s1:H ; w) = π(aH |sH ; w) p(st+1 |at , st )π(at |st ; w) p1 (s1 ), H ∈ N. (2) t=1 Note that the policy is now written in terms of its parametric representation, π(a|s; w). It is well known that the global optimum of (1) can be obtained through dynamic programming, see e.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly non-linear transition dynamics, it is not possible to ﬁnd the global optimum of (1) in most real-world problems of interest. Instead most research in this area focuses on obtaining approximate solutions, for which there exist numerous techniques, such as approximate dynamic programming methods [6], Monte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18] and non-parametric [2, 25]. This work is focused solely on parametric policy search methods, by which we mean gradient-based methods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation [11], which is a bound optimisation technique from the statistics literature. Since their introduction [14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it focusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approximation techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has enjoyed some success it is known to suffer from some substantial issues that often make it unattractive in practice, such as slow convergence and susceptibility to poor scaling of the objective function [23]. Various optimisation methods have been introduced as an alternative, most notably natural gradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods of choice among parametric policy search algorithms. In this paper our primary focus is on the search-direction (in the parameter space) of these two methods. 2 Search Direction Analysis In this section we will perform a novel analysis of the search-direction of both natural gradient ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes the update of the policy parameters take the form wnew = w + αM(w) w U (w), (3) + where α ∈ R is the step-size parameter and M(w) is some positive-deﬁnite matrix that possibly depends on w. It is well-known that such an update will increase the total expected reward, provided that α is sufﬁciently small, and this process will converge to a local optimum of (1) provided the step-size sequence is appropriately selected. While EM doesn’t have an update of the form given in (3) we shall see that the algorithm is closely related to such an update. It is convenient for later reference to note that the gradient w U (w) can be written in the following form w U (w) = Epγ (z;w)Q(z;w) w log π(a|s; w) , (4) where we use the expectation notation E[·] to denote the integral/summation w.r.t. a non-negative function. The term pγ (z; w) is a geometric weighted average of state-action occupancy marginals given by ∞ γ t−1 pt (z; w), pγ (z; w) = t=1 while the term Q(z; w) is referred to as the state-action value function and is equal to the total expected future reward from the current time-point onwards, given the current state-action pair, z, 2 and parameter vector, w, i.e. ∞ Ept (z ;w) γ t−1 R(z ) z1 = z . Q(z; w) = t=1 This is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or section(6.1) of the supplementary material for more details. An immediate issue concerning updates of the form (3) is in the selection of the ‘optimal’ choice of the matrix M(w), which clearly depends on the sense in which optimality is deﬁned. There are numerous reasonable properties that are desirable of such an update, including the numerical stability and computational complexity of the parameter update, as well as the rate of convergence of the overall algorithm resulting from these updates. While all reasonable criteria the rate of convergence is of such importance in an optimisation algorithm that it is a logical starting point in our analysis. For this reason we concern ourselves with relating these two parametric policy search algorithms to the Newton method, which has the highly desirable property of having a quadratic rate of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the Newton method in more detail in section(3). In the Newton method the matrix M(w) is set to the negative inverse Hessian, i.e. M(w) = −H−1 (w), where H(w) = w T w U (w). where we have denoted the Hessian by H(w). Using methods similar to those used to calculate the gradient, it can be shown that the Hessian takes the form H(w) = H1 (w) + H2 (w), (5) where ∞ Ep(z1:t ;w) γ t−1 R(zt ) w Ep(z1:t ;w) γ t−1 R(zt ) H1 (w) = w log p(z1:t ; w) T w log p(z1:t ; w) , (6) t=1 ∞ H2 (w) = T w log p(z1:t ; w) . (7) t=1 We have omitted the details of the derivation, but these can be found in section(6.2) of the supplementary material, with a similar derivation of a sample-based estimate of the Hessian given in [4]. 2.1 Natural Gradient Ascent To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural gradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network and blind source separation literature, see e.g. [1], and take the perspective that the parameter space has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest ascent direction of U (w) w.r.t. a local norm deﬁned on this parameter manifold (as opposed to w.r.t. the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. We denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e. d(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its application to the objective (1) results in a parameter update of the form wk+1 = wk + αk G−1 (wk ) w U (wk ). −1 In terms of (3) this corresponds to M(w) = G (w). In the case of MDPs the most commonly used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. [3, 24], and due to the Markovian structure of the dynamics it is given by G(w) = −Epγ (z;w) w T w log π(a|s; w) . (8) We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see e.g. [24], but we do not use it in this work. 3 In order to relate natural gradient ascent to the Newton method we ﬁrst rewrite the matrix (7) into the following form H2 (w) = Epγ (z;w)Q(z;w) w T w log π(a|s; w) . (9) For reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplementary material. Comparing the Fisher information matrix (8) with the form of H2 (w) given in (9) it is clear that natural gradient ascent has a relationship with the approximate Newton method that uses H2 (w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting −1 M(w) = −H2 (w). In particular it can be seen that the difference between the two methods lies in the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears that there is a difference in sign, but observing the form of M(w) for each algorithm shows that this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometrically weighted summation of state-action occupancy marginals of the trajectory distribution, while in H2 (w) there is an additional weighting from the state-action value function. Hence, H2 (w) incorporates information about the reward structure of the objective function, whereas the Fisher information matrix does not, and so it will generally contain more information about the curvature of the objective function. 2.2 Expectation Maximisation The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique from the statistics literature, see e.g. [11], that has recently been the centre of much research in the planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective log U (w) ≥ Hentropy (q(z1:t , t)) + Eq(z1:t ,t) log γ t−1 R(zt )p(z1:t ; w) , (10) where Hentropy is the entropy function and q(z1:t , t) is known as the ‘variational distribution’. Further details of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the supplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm is given by the maximum (w.r.t. w) of the ‘energy’ term, Q(w, wk ) = Epγ (z;wk )Q(z;wk ) log π(a|s; w) . (11) Note that Q is a two-parameter function, where the ﬁrst parameter occurs inside the expectation and the second parameter deﬁnes the non-negative function w.r.t. the expectation is taken. This decoupling allows the maximisation over w to be performed explicitly in many cases of interest. For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a least-squares problem and the optimum can be found through solving a linear system of equations. It has previously been noted, again see e.g. [18], that the parameter update of steepest gradient ascent and the EM-algorithm can be related through this ‘energy’ term. In particular the gradient (4) evaluated at wk can also be written as follows w|w=wk U (w) = 10 w|w=wk Q(w, wk ), where 10 we use the notation w to denote the ﬁrst derivative w.r.t. the ﬁrst parameter, while the update of the EM-algorithm is given by wk+1 = argmaxw∈W Q(w, wk ). In other words, steepest gradient ascent moves in the direction that most rapidly increases Q(w, wk ) w.r.t. the ﬁrst variable, while the EM-algorithm maximises Q(w, wk ) w.r.t. the ﬁrst variable. While this relationship is true, it is also quite a negative result. It states that in situations where it is not possible to explicitly perform the maximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as EM are typically considered because of the negative aspects related to steepest gradient ascent this is an undesirable alternative. It is possible to ﬁnd the optimum of (11) numerically, but this is also undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of its parameter update, when the maximisation over w in (11) can be performed explicitly. Instead we provide the following result, which shows that the step-direction of the EM-algorithm has an underlying relationship with the Newton method. In particular we show that, under suitable 4 regularity conditions, the direction of the EM-update, i.e. wk+1 − wk , is the same, up to ﬁrst order, as the direction of an approximate Newton method that uses H2 (w) in place of H(w). Theorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian trajectory distribution (2). Consider the update of the parameter through Expectation Maximisation at the k th iteration of the algorithm, i.e. wk+1 = argmaxw∈W Q(w, wk ). Provided that Q(w, wk ) is twice continuously differentiable in the ﬁrst parameter we have that −1 wk+1 − wk = −H2 (wk ) w|w=wk U (w) + O( wk+1 − wk 2 ). (12) Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton method is exact, i.e. the second term on the r.h.s. (12) is zero. Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 10 w Q(w, wk ). As Q is assumed to be twice continuously differentiable in the ﬁrst component this Taylor expansion is possible and gives 10 w Q(wk+1 , wk ) = 10 w Q(wk , wk ) + 20 w Q(wk , wk )(wk+1 − wk ) + O( wk+1 − wk 2 ). (13) As wk+1 = argmaxw∈W Q(w, wk ) it follows that 10 Q(wk+1 , wk ) = 0. This means that, upon w ignoring higher order terms in wk+1 − wk , the Taylor expansion (13) can be rewritten into the form wk+1 − wk = − 20 −1 w Q(wk , wk ) 10 w Q(wk , wk ). (14) 10 = The proof is completed by observing that w|w=wk U (w) and w Q(wk , wk ) 20 Q(wk , wk ) = H2 (wk ). The second statement follows because in the case where the log-policy w is quadratic the higher order terms in the Taylor expansion vanish. 2.3 Summary In this section we have provided a novel analysis of both natural gradient ascent and Expectation Maximisation when applied to the MDP framework. Previously, while both of these algorithms have proved popular methods for MDP optimisation, there has been little understanding of them in terms of their search-direction in the parameter space or their relation to the Newton method. Firstly, our analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar to H2 (w) in (5) with the exception that the information about the reward structure of the problem is not contained in the Fisher information matrix, while such information is contained in H2 (w). Additionally we have shown that the step-direction of the EM-algorithm is, up to ﬁrst order, an approximate Newton method that uses H2 (w) in place of H(w) and employs a constant step-size of one. 3 An Approximate Newton Method −1 A natural follow on from the analysis in section(2) is the consideration of using M(w) = −H2 (w) in (3), a method we call the full approximate Newton method from this point onwards. In this section we show that this method has many desirable properties that make it an attractive alternative to other parametric policy search methods. Additionally, denoting the diagonal matrix formed from the diagonal elements of H2 (w) by D2 (w), we shall also consider the method that uses M(w) = −1 −D2 (w) in (3). We call this second method the diagonal approximate Newton method. Recall that in (3) it is necessary that M(w) is positive-deﬁnite (in the Newton method this corresponds to requiring the Hessian to be negative-deﬁnite) to ensure an increase of the objective. In general the objective (1) is not concave, which means that the Hessian will not be negative-deﬁnite over the entire parameter space. In such cases the Newton method can actually lower the objective and this is an undesirable aspect of the Newton method. An attractive property of the approximate Hessian, H2 (w), is that it is always negative-deﬁnite when the policy is log–concave in the policy parameters. This fact follows from the observation that in such cases H2 (w) is a non-negative mixture of negative-deﬁnite matrices, which again is negative-deﬁnite [9]. Additionally, the diagonal 5 terms of a negative-deﬁnite matrix are negative and so D2 (w) is also negative-deﬁnite when the controller is log-concave. To motivate this result we now brieﬂy consider some widely used policies that are either log-concave or blockwise log-concave. Firstly, consider the Gibb’s policy, π(a|s; w) ∝ exp wT φ(a, s), where φ(a, s) ∈ Rnw is a feature vector. This policy is widely used in discrete systems and is logconcave in w, which can be seen from the fact that log π(a|s; w) is the sum of a linear term and a negative log-sum-exp term, both of which are concave [9]. In systems with a continuous stateaction space a common choice of controller is π(a|s; wmean , Σ) = N (a|Kφ(s) + m, Σ(s)), where wmean = {K, m} and φ(s) ∈ Rnw is a feature vector. The notation Σ(s) is used because there are cases where is it beneﬁcial to have state dependent noise in the controller. This controller is not jointly log-concave in wmean and Σ, but it is blockwise log-concave in wmean and Σ−1 . In terms of wmean the log-policy is quadratic and the coefﬁcient matrix of the quadratic term is negative-deﬁnite. In terms of Σ−1 the log-policy consists of a linear term and a log-determinant term, both of which are concave. In terms of evaluating the search direction it is clear from the forms of D2 (w) and H2 (w) that many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton framework with little difﬁculty. In particular, gradient evaluation requires calculating the expectation of the derivative of the log-policy w.r.t. pγ (z; w)Q(z; w). In terms of inference the only additional calculation necessary to implement either the full or diagonal approximate Newton methods is the calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its diagonal terms. As an example in section(6.5) of the supplementary material we detail the extension of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. [31], to the approximate Newton method. We use this extension in the Tetris experiment that we consider in section(4). Given ns samples and nw parameters the complexity of these extensions scale as O(ns nw ) for the diagonal approximate Newton method, while it scales as O(ns n2 ) for the w full approximate Newton method. An issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3 ) in w the worst case. In the standard application of the Newton method this inversion has to be performed at each iteration and in large parameter systems this becomes prohibitively costly. In general H(w) will be dense and no computational savings will be possible when performing this matrix inversion. The same is not true, however, of the matrices D2 (w) and H2 (w). Firstly, as D2 (w) is a diagonal matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from taking the second derivative of the log-trajectory distribution in (7). This property ensures that any (product) sparsity over the control parameters in the log-trajectory distribution will correspond to sparsity in H2 (w). For example, in a partially observable Markov Decision Processes where the policy is modeled through a ﬁnite state controller, see e.g. [22], there are three functions to be optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When the parameters of these three functions are independent H2 (w) will be block-diagonal (across the parameters of the three functions) and the matrix inversion can be performed more efﬁciently by inverting each of the block matrices individually. The reason that H(w) does not exhibit any such sparsity properties is due to the term H1 (w) in (5), which consists of the non-negative mixture of outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory distribution and this typically produces a dense matrix. A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of basis used to represent the parameter space. An important and desirable property of the Newton method is that it is invariant to non-singular linear (afﬁne) transformations of the parameter space, see e.g. [9]. This means that given a non-singular linear (afﬁne) mapping T ∈ Rnw ×nw , the Newton ˜ update of the objective U (w) = U (T w) is related to the Newton update of the original objective through the same linear (afﬁne) mapping, i.e. v + ∆vnt = T w + ∆wnt , where v = T w and ∆vnt and ∆wnt denote the respective Newton steps. In other words running the Newton method on U (w) ˜ and U (T −1 w) will give identical results. An important point to note is that this desirable property is maintained when using H2 (w) in an approximate Newton method, while using D2 (w) results in a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with non-zero elements along the diagonal. This can be seen by using the linearity of the expectation operator and a proof of this statement is provided in section(6.4) of the supplementary material. 6 20 Completed Lines 400 θ2 15 10 5 0 −10 −8 −6 −4 θ1 −2 0 300 200 100 0 0 2 (a) Policy Trace 20 40 60 80 Training Iterations 100 (b) Tetris Problem Figure 1: (a) An empirical illustration of the afﬁne invariance of the approximate Newton method, performed on the two state MDP of [16]. The plot shows the trace of the policy during training for the two different parameter spaces, where the results of the latter have been mapped back into the original parameter space for comparison. The plot shows the two steepest gradient ascent traces (blue cross and blue circle) and the two traces of the full approximate Newton method (red cross and red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton method (red). 4 Experiments The ﬁrst experiment we performed was an empirical illustration that the full approximate Newton method is invariant to linear transformations of the parameter space. We considered the simple two state example of [16] as it allows us to plot the trace of the policy during training, since the policy has only two parameters. The policy was trained using both steepest gradient ascent and the full approximate Newton method and in both the original and linearly transformed parameter space. The policy traces of the two algorithms are plotted in ﬁgure(1.a). As expected steepest gradient ascent is affected by such mappings, whilst the full approximate Newton method is invariant to them. The second experiment was aimed at demonstrating the scalability of the full and diagonal approximate Newton methods to problems with a large state space. We considered the tetris domain, which is a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly, we compared the performance of the full and diagonal approximate Newton methods to other parametric policy search methods. Tetris is typically played on a 20 × 10 grid, but due to computational costs we considered a 10 × 10 grid in the experiment. This results in a state space with roughly 7 × 2100 states. We modelled the policy through a Gibb’s distribution, where we considered a feature vector with the following features: the heights of each column, the difference in heights between adjacent columns, the maximum height and the number of ‘holes’. Under this policy it is not possible to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not possible in this problem. We therefore compared the diagonal and full approximate Newton methods with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the experiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the experiment, each consisting of 100 training iterations, and the mean and standard error of the results are given in ﬁgure(1.b). It can be seen that the full approximate Newton method outperforms all of the other methods, while the performance of the diagonal approximate Newton method is comparable to natural gradient ascent. We also ran several training runs of the full approximate Newton method on the full-sized 20 × 10 board and were able to obtain a score in the region of 14, 000 completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic programming based method has previously been applied to the Tetris domain in [7]. The same set of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 training iterations, after which the solution then deteriorated. In the third experiment we considered a ﬁnite horizon (controlled) linear dynamical system. This allowed the search-directions of the various algorithms to be computed exactly using [13] and removed any issues of approximate inference from the comparison. In particular we considered a 3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a 7 Normalised Total Expected Reward Normalised Total Expected Reward 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 Training Time 600 (a) Model-Based Linear System 1 0.9 0.8 0.7 0.6 0 200 400 600 Training Iterations 800 (b) Model-Free Non-Linear System Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. (b) The normalised total expected reward plotted against training iterations for the synthetic non-linear system of [29]. The plot shows the results for EM (blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the mean and standard error of the results. 6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Further details of the system can be found in section(6.7) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results plotted in ﬁgure(2.a). In this experiment the approximate Newton method found substantially better solutions than either steepest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the results in comparison to either steepest or natural gradient ascent can be explained by the fact that H2 (w) gives a better estimate of the curvature of the objective function. Expectation Maximisation performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, illustrates the susceptibility of this method to poor scaling. In the ﬁnal experiment we considered the synthetic non-linear system considered in [29]. Full details of the system and the experiment can be found in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean and standard error of the results are plotted in ﬁgure(2.b). Again the approximate Newton method outperforms both steepest and natural gradient ascent. In this example only the mean parameters of the Gaussian controller are optimised, while the parameters of the noise are held ﬁxed, which means that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm is a particular (less general) version of the approximate Newton method, where a ﬁxed step-size of one is used throughout. The marked difference in performance between the EM-algorithm and the approximate Newton method shows the beneﬁt of being able to tune the step-size sequence. In this experiment we considered ﬁve different step-size sequences for the approximate Newton method and all of them obtained superior results than the EM-algorithm. In contrast only one of the seven step-size sequences considered for steepest and natural gradient ascent outperformed the EM-algorithm. 5 Conclusion The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation Maximisation and natural gradient ascent when applied to the MDP framework, showing that both have close connections to an approximate Newton method; Secondly, prompted by this analysis we have considered the direct application of this approximate Newton method to the optimisation of MDPs, showing that it has numerous desirable properties that are not present in the naive application of the Newton method. In terms of empirical performance we have found the approximate Newton method to perform consistently well in comparison to EM and natural gradient ascent, highlighting its viability as an alternative to either of these methods. At present we have only considered actor type implementations of the approximate Newton method and the extension to actor-critic methods is a point of future research. 8 References [1] S. Amari. Natural Gradient Works Efﬁciently in Learning. Neural Computation, 10:251–276, 1998. [2] M. Azar, V. G´ mez, and H. Kappen. Dynamic policy programming with function approximation. Journal o of Machine Learning Research - Proceedings Track, 15:119–127, 2011. [3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019–1024, 2003. [4] J. Baxter and P. Bartlett. Inﬁnite Horizon Policy Gradient Estimation. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. [5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, second edition, 2000. [6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report, Massachusetts Institute of Technology, 2010. [7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in NeuroDynamic Programming. Research report, Massachusetts Institute of Technology, 1997. [8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica, 45:2471–2482, 2009. [9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Computation, 9:271–278, 1997. [11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en. html, 2003. [13] T. Furmston and D. Barber. Efﬁcient Inference for Markov Control Problems. UAI, 29:221–229, 2011. [14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM, 33:97–84, 1990. [15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5:1471–1530, 2004. [16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531–1538, 2002. [17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001. [18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171– 203, 2011. [19] L. Kocsis and C. Szepesv´ ri. Bandit Based Monte-Carlo Planning. European Conference on Machine a Learning (ECML), 17:282–293, 2006. [20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143–1166, 2003. [21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001. [22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Observable Environments. UAI, 15:427–436, 1999. [23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006. [24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180–1190, 2008. [25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference. International Conference on Robotics Science and Systems, 2012. [26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Trafﬁc Optimisation. NIPS, 19:1169– 1176, 2007. [27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS, 13:1057–1063, 2000. [28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006. [29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2):123–130, 2009. [30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI, 17(29):538–545, 2001. [31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992. 9

5 0.16452946 59 nips-2012-Bayesian nonparametric models for bipartite graphs

Author: Francois Caron

Abstract: We develop a novel Bayesian nonparametric model for random bipartite graphs. The model is based on the theory of completely random measures and is able to handle a potentially inﬁnite number of nodes. We show that the model has appealing properties and in particular it may exhibit a power-law behavior. We derive a posterior characterization, a generative process for network growth, and a simple Gibbs sampler for posterior simulation. Our model is shown to be well ﬁtted to several real-world social networks. 1

6 0.15809371 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

7 0.14846376 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

8 0.13292243 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

9 0.13129604 165 nips-2012-Iterative ranking from pair-wise comparisons

10 0.12979063 300 nips-2012-Scalable nonconvex inexact proximal splitting

11 0.12440249 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

12 0.12271948 75 nips-2012-Collaborative Ranking With 17 Parameters

13 0.11896458 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

14 0.11148499 142 nips-2012-Generalization Bounds for Domain Adaptation

15 0.098525852 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

16 0.098301999 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

17 0.098220907 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

18 0.095938839 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

19 0.093810864 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

20 0.092055328 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.194), (1, 0.01), (2, 0.067), (3, 0.058), (4, -0.14), (5, -0.106), (6, 0.041), (7, 0.064), (8, 0.234), (9, 0.227), (10, -0.186), (11, 0.114), (12, -0.071), (13, -0.1), (14, 0.001), (15, -0.117), (16, -0.028), (17, -0.035), (18, 0.003), (19, -0.032), (20, 0.088), (21, -0.007), (22, 0.061), (23, -0.015), (24, -0.037), (25, 0.014), (26, -0.025), (27, -0.069), (28, -0.094), (29, 0.052), (30, -0.044), (31, 0.126), (32, 0.068), (33, -0.12), (34, 0.031), (35, -0.001), (36, 0.092), (37, 0.03), (38, 0.008), (39, -0.107), (40, -0.158), (41, 0.019), (42, 0.083), (43, 0.01), (44, -0.083), (45, 0.007), (46, 0.002), (47, -0.09), (48, -0.063), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96329433 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

2 0.67480528 59 nips-2012-Bayesian nonparametric models for bipartite graphs

Author: Francois Caron

3 0.58928758 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

Author: Nicholas Foti, Sinead Williamson

4 0.53404725 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes

Author: Dahua Lin, John W. Fisher

Abstract: Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Speciﬁcally, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from previous ones that require the model structure to be a tree or a chain, allowing more ﬂexible designs. We also derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling. 1

5 0.51308173 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

Author: Maksims Volkovs, Richard S. Zemel

Abstract: Bipartite matching problems characterize many situations, ranging from ranking in information retrieval to correspondence in vision. Exact inference in realworld applications of these problems is intractable, making efﬁcient approximation methods essential for learning and inference. In this paper we propose a novel sequential matching sampler based on a generalization of the PlackettLuce model, which can effectively make large moves in the space of matchings. This allows the sampler to match the difﬁcult target distributions common in these problems: highly multimodal distributions with well separated modes. We present experimental results with bipartite matching problems—ranking and image correspondence—which show that the sequential matching sampler efﬁciently approximates the target distribution, signiﬁcantly outperforming other sampling approaches. 1

6 0.49727011 75 nips-2012-Collaborative Ranking With 17 Parameters

7 0.47652707 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

8 0.45686519 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

9 0.44715881 288 nips-2012-Rational inference of relative preferences

10 0.44293168 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

11 0.43910146 244 nips-2012-Nonconvex Penalization Using Laplace Exponents and Concave Conjugates

12 0.43674392 286 nips-2012-Random Utility Theory for Social Choice

13 0.4226115 165 nips-2012-Iterative ranking from pair-wise comparisons

14 0.4183245 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

15 0.41031903 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

16 0.40706232 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

17 0.39402375 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

18 0.37666723 57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

19 0.37160948 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

20 0.35945398 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (21, 0.36), (38, 0.113), (42, 0.026), (53, 0.01), (54, 0.013), (55, 0.02), (74, 0.032), (76, 0.152), (80, 0.12), (92, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92760456 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a speciﬁc pattern of recurrent connections learned by the model. 1

2 0.88391107 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efﬁciency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach signiﬁcantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

3 0.84526581 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is ﬁrst to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1

same-paper 4 0.84383923 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

5 0.81724077 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufﬁciently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are conﬁdent about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classiﬁcation and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacriﬁcing prediction accuracy. 1

6 0.77964789 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

7 0.7751171 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

8 0.70539427 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

9 0.70393538 23 nips-2012-A lattice filter model of the visual pathway

10 0.68315089 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

11 0.66653264 190 nips-2012-Learning optimal spike-based representations

12 0.65320969 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

13 0.64953184 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

14 0.64793521 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

15 0.64539534 94 nips-2012-Delay Compensation with Dynamical Synapses

16 0.64303571 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

17 0.63658756 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

18 0.63290703 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

19 0.63280809 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

20 0.62857974 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex